ID A1812
Submission Date : 11/06/2019
House Price prediction is a very popular dataset for data science competition. In this dataset 79 explanatory variables describing (almost) every aspect of residential homes in Ames and Iowa. This competition challenges competitor to predict the final price of each home.
In this report my main focus is how artificial neural network performs for this kind of problems and how to improve performance of the prediction using artificial neural network. So my elaboration on that section will be much more detailed.I have divided my work in four part and they are
In the following section, I have imported all the necessary libraries that I will need to properly complete the assignment.
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from IPython.display import Image
from sklearn.preprocessing import normalize,MinMaxScaler
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
import seaborn as sns
# %matplotlib widget
%matplotlib inline
The following block of code reads the two CSV (Comma Separated Values) files and then stores the data inside them in two separate Dataframes named 'train' and 'test'.
train = pd.read_csv('train.csv')#.select_dtypes(exclude=['object'])
test = pd.read_csv('test.csv')#.select_dtypes(exclude=['object'])
#look into datatypes of the file
print("data types count")
train.dtypes.groupby(train.dtypes).count()
Here I am printing the first five entries in the train dataset to look into the actual data that I will be working with. I gives me some insight about the data I am working with.
print('show sample')
pd.set_option('display.max_column', None)
train.head()
Here I have used one of the built-in functions of pandas dataframe to display descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
print('description of data')
train.describe()
This function shows scatter-plot and distribution plot. I am going to use it to see few of the features of the dataset and observe how it changes while I process the data. I will try not to remove data so instead of removing any data point I will observe them until all my data processing is complete. If I found out after all the processing some data points are really causing problem then I will drop it.
#For showing diffrence
old_train_outlier_flag =train.copy()
old_test_outlier_flag =test.copy()
old_target_outlier_flag =train.SalePrice.copy()
# A FUNCTION THAT SHOWS SCATTER-PLOT AND DISTRIBUTION-PLOT
def outlier_check_plot(column, train_data_flag=train , test_data_flag=test , target=train.SalePrice ):
plt.subplots(figsize=(19, 5))
# SCATTER PLOT OF THE 19 HIGHEST-VALUES OF A COLUMN
plt.subplot(1, 3, 1)
plt.scatter(x = train_data_flag[column].sort_values(ascending=False)[:19], y = train.Id[:19], color='red', label='Train' )
plt.scatter(x = test_data_flag[column].sort_values(ascending=False)[:19], y = test.Id[:19], label='Test')
plt.ylabel('Serial Number', fontsize=13)
plt.xlabel(column, fontsize=13)
plt.title('Fig 1: 19 highest-values of category {} \n in both train and test dataset'.format(column))
plt.legend(loc='center',fontsize=13)
# DISTRIBUTION-PLOT OF THE COLUMN
plt.subplot(1, 3, 2)
sns.distplot(train_data_flag[column],color='red', rug=True, hist=False, label='Train')
sns.distplot(test_data_flag[column], rug=True, hist=False, label='Test')
plt.ylabel('Distribution', fontsize=13)
plt.xlabel(column, fontsize=13)
plt.title('Fig 2: Distribution-plot of category {} \n for both train and test dataset'.format(column))
plt.legend(fontsize=13)
# SCATTER-PLOT OF THE COLUMN WITH RESPECT TO SALEPRICE
plt.subplot(1, 3, 3)
plt.scatter(x = train_data_flag[column], y = target)
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel(column, fontsize=13)
plt.title('Fig 3: Scatter-plot of train-category {} \n with respect to SalePrice'.format(column))
plt.show()
print('Before outlier-removal of 1stFlrSF: ')
outlier_check_plot('1stFlrSF')
We can see one value in train set that is highly contradictory with SalePrice (1stFlrSF is too high but SalePrice is too low). And there is only one such high-value point available in test dataset. So we might want to remove this outlier.
print('Before outlier-removal of BsmtFinSF1: ')
outlier_check_plot('BsmtFinSF1')
We can also see the same outlier here.
print('Before outlier-removal of LotArea: ')
outlier_check_plot('LotArea')
We can see in Fig 3 that there are 4 LotArea train-samples above 80000 that are very high in size but comperatively very low in SalePrice. Also there are no such values present in test-data: Fig 1. So we can drop them
print('Before outlier-removal of GrLivArea: ')
outlier_check_plot('GrLivArea')
If we compare Fig. 3 with code-cell 13 we can see that two outliers are already common in GrLivArea. These two outliers of GrLivArea train-samples were above 4000 with very low SalePrice (below 300000). We are seeing same outlier again and again.
print('Before outlier-removal of MasVnrArea: ')
outlier_check_plot('MasVnrArea')
As we can see in Fig 3 that above 1500 there is 1 MasVnrArea train-samples that are very high in size but comperatively very low in SalePrice (below 300000) and there is no such values present in test-data: Fig 1. But this case is not so common outlier in other sections so keeping it would be safe for now.
print('Before outlier-removal of LotFrontage: ')
outlier_check_plot('LotFrontage')
As we can see in Fig 3 that above 200 there is 1 LotFrontage train-samples that is very high in size but comperatively very low in SalePrice (below 300000) and there is no such value present in test-data. But one of them seems to be the common outlier which is below 20000(saleprice). We should remove the common one and observe the other.
print('Before outlier-removal of TotalBsmtSF: ')
outlier_check_plot('TotalBsmtSF')
We can also see the common outlier and we would be removing the common outlier in the next section.
fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
There are a few houses with more than 4000 sq ft living area that are outliers, so we drop them from the training data.
train.drop(train[ (train["GrLivArea"] > 4000) ].index, inplace=True)
#Check the graph again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
Its a linear relation so this feature is helpful to predict the price.
This relationship is also linear so we can expect that it also have great impact on the price.
#scatter plot totalbsmtsf/saleprice
var = 'TotalBsmtSF'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
We have removed the common outlier and now the graph seems better and we will follow up later after all the data pre processing. If any outlier remains after all processing I will remove them.
#box plot overallqual/saleprice
import seaborn as sns
var = 'OverallQual'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
As expected saleprice increases when overall quality increases.
var = 'YearBuilt'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);
We can see that people tends to spend more for newly built houses. Although its does not seems really a storong feature acording to plot but its really importent if we consider other parameters too.
We have analised four variables, but there are many other that we should analyse. The trick here seems to be the choice of the right features (feature selection) and not the definition of complex relationships between them (feature engineering).
The correlation coefficient is a statistical calculation that is used to examine the relationship between two sets of data. The value of the correlation coefficient tells us about the strength and the nature of the relationship.
Correlation coefficient values can range between +1.00 to -1.00. If the value is exactly +1.00, it means that there is a "perfect" positive relationship between two numbers, while a value of exactly -1.00 indicates a "perfect" negative relationship.
If correlation is Positive then the values increase together and if the correlation is Negative, one value decreases as the other increases. When two sets of data are strongly linked together we say they have a High Correlation.
#correlation matrix
corrmat = train.corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.set(font_scale=1.25)
sns.heatmap(corrmat, vmax=.8, square=True);
In my opinion, this heatmap is the best way to get a quick overview the relationships of a dataset.
At first sight, there are two red colored squares that get my attention. The first one refers to the 'TotalBsmtSF' and '1stFlrSF' variables, and the second one refers to the 'GarageX' variables. Both cases show how significant the correlation is between these variables. Actually, this correlation is so strong that it can indicate a situation of multicollinearity. If we think about these variables, we can conclude that they give almost the same information so multicollinearity really occurs. Heatmaps are great to detect this kind of situations and in problems dominated by feature selection, like ours, they are an essential tool.
Another thing that got my attention was the 'SalePrice' correlations. We can see our well-known 'GrLivArea', 'TotalBsmtSF', and 'OverallQual' is closely related to salePrice, but we can also see many other variables that should be taken into account. So we are zooming in.
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
f, ax = plt.subplots(figsize=(15, 12))
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
'GarageCars' and 'GarageArea' are also some of the most strongly correlated variables.The number of cars that fit into the garage is a consequence of the garage area. 'GarageCars' and 'GarageArea' are really close.Therefore, we just need one of these variables in our analysis (we can keep 'GarageCars' since its correlation with 'SalePrice' is higher).
'TotalBsmtSF' and '1stFloor' also seem to be really close. We can keep 'TotalBsmtSF'
Important questions when thinking about missing data:
The answer to these questions is important for practical reasons because missing data can imply a reduction of the sample size. This can prevent us from proceeding with the analysis. Moreover, from a substantive perspective, we need to ensure that the missing data process is not biased and hiding an inconvenient truth.
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
I have tried few approaches to data preprocessing. The current one was the best one for all the models.Below are the steps I have taken to preprocess the data.
I have filled missing values of some data features with zero because these missing value means it does not exist in the house.
I have label encoded the ordinal value containing features. Ordinal values are which are used something along the line of "Good","Average","Bad"
I have label encodded object type data which are not ordinal in nature
I have also done some feature engineering, meaning I have created some new features from already existing features.
Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated.
In this dataset, there are lot of features which don't represent a quantitative value but rather is actually a label of some sort. For this particular dataset, almost all of the labeled values are in the form of 'string' or words. Only a couple of the labels are represented with numbers. For example, lets check the feature 'Alley', which denotes the type of alley access to the property using the following labels. The meaning of the labels are also given
Grvl Gravel
Pave Paved
NA No alley access
In the real world, labels are in the form of words, because words are human readable. So it makes sense from that perspective. But when it comes tho machine learning models, which works with numbers, we hit a bit of a roadblock. To remedy this, there is a need to use Label Encoding. Label encoding refers to the process of transforming the word labels into numerical form. This enables the algorithms to operate on data that have textual labels
In case of the labels there are two distinct types, "nominal" and "ordinal". The terms "nominal" and "ordinal" refer to different types of categorizable data.
"Nominal" data assigns names to each data point without placing it in some sort of order. For example, the results of a test could be each classified nominally as a "pass" or "fail."
"Ordinal" data groups data according to some sort of ranking system: it orders the data. For example, this dataset has a very common ranking system which is as follows
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
ntrain = train.shape[0]
ntest = test.shape[0]
target = train.SalePrice
all_data = pd.concat((train, test), sort=False).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))
Two of these following part would be used in the common data processing section to impute missing data.
lot_frontage_by_neighborhood = all_data["LotFrontage"].groupby(all_data["Neighborhood"])
Following function will be used to convert categorical features as number.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
def factorize(df, factor_df, column, fill_na=None):
factor_df[column] = df[column]
if fill_na is not None:
factor_df[column].fillna(fill_na, inplace=True)
le.fit(factor_df[column].unique())
factor_df[column] = le.transform(factor_df[column])
return factor_df
In the following section we are looking into the features that are more related to saleprice. we should handle mostly corrilated features carefully because they will contribute more for our prediction.
# THE FEATURES MOSTLY CORRELATED TO SALESPRICE
cols = train.dtypes[train.dtypes != 'object'].index
corrs=[]
for item in cols:
corrs.append((train[item].corr(train['SalePrice'])))
ist = pd.DataFrame(
{'cols': cols,
'corrs': corrs
})
ist = ist.sort_values(by='corrs', ascending=False)
#ist.head()
plt.subplots(figsize=(19, 4))
sns.barplot(x=ist['cols'], y=ist['corrs'])
plt.xticks(rotation=90)
plt.ylabel('Number of Unique-Values', fontsize=13)
plt.xlabel('Column-Names', fontsize=13)
plt.title('The features mostly corrilated to salesprice')
plt.show()
All features of house-price dataset have been separated into 3 different categories: Type, Size and Period(Year/Month). Type or size based features represent type or size of the sample respectively. Type-based features are generally non-numeric and have very few categories.
def type_based_feature_analysis(column, rotation = None):
order_all_data = all_data[column].value_counts().index
order_train = train[column].value_counts().index
if rotation is None: rotation = 90
plt.subplots(figsize =(19, 4))
plt.subplot(1, 5, 1)
sns.barplot(x=train[column], y=train['SalePrice'], order = order_train)
plt.xlabel(column+' (train)')
plt.xticks(rotation=rotation)
plt.subplot(1, 5, 2)
sns.countplot(x=train[column], order = order_train)
plt.xlabel(column+' (train)')
plt.xticks(rotation=rotation)
plt.subplot(1, 5, 3)
sns.countplot(x=all_data[column], order = order_all_data)
plt.xlabel(column+' (all_data)')
plt.xticks(rotation=rotation)
plt.subplot(1, 5, 4)
sns.stripplot(x=train[column], y=train['SalePrice'], jitter = True, order = order_train)
plt.xlabel(column+' (train)')
plt.xticks(rotation=rotation)
plt.subplot(1, 5, 5)
sns.boxplot(x=train[column], y=train['SalePrice'], order = order_train)
plt.xlabel(column+' (train)')
plt.xticks(rotation=rotation)
plt.show()
Size based features show the area: square-feet that are continuous numeric values.
from scipy.stats.stats import pearsonr
def size_based_feature_analysis(column):
grid = plt.GridSpec(3, 4)
plt.subplots(figsize =(19, 7))
zero = 0
rotation = 90
plt.subplot(grid[zero, 0])
sns.barplot(x=train['HouseStyle'],y=train[column])
plt.xticks(rotation=rotation)
plt.subplot(grid[zero,1])
sns.barplot(x=train['BldgType'],y=train[column])
plt.xticks(rotation=rotation)
plt.subplot(grid[zero, 2])
sns.barplot(x=train['LotShape'],y=train[column])
plt.xticks(rotation=rotation)
plt.subplot(grid[zero, 3])
g = sns.regplot(x=train[column], y=train['SalePrice'], fit_reg=False,
label = "corr: %2f"%(pearsonr(train[column], train['SalePrice'])[0]))
g = g.legend(loc='best', fontsize=12)
plt.xticks(rotation=rotation)
plt.subplot(grid[2, 0:])
sns.boxplot(x=train['Neighborhood'],y=train[column])
plt.xticks(rotation=rotation)
plt.show()
Period/Year based features show duration or periods.
def year_based_feature_analysis(column, rotation = None, box = True):
if rotation is None: rotation = 90
if(box == True):
n = 3
a = 2
b = 3
plt.subplots(figsize =(19, 16))
else:
n = 2
b = 2
plt.subplots(figsize =(19, 9))
plt.subplot(n, 1, 1)
sns.barplot(x=train[column], y=train['SalePrice'])
plt.xlabel(column+' (train)')
plt.xticks(rotation=rotation)
if(box == True):
plt.subplot(n, 1, a)
sns.boxplot(x=train[column], y=train['SalePrice'])
plt.xlabel(column+' (train)')
plt.xticks(rotation=rotation)
plt.subplot(n, 1, b)
sns.stripplot(x=train[column], y=train['SalePrice'], jitter = True)
plt.xlabel(column+' (train)')
plt.xticks(rotation=rotation)
plt.show()
All these 3 categories have their own unique pattern of values. Therefore they have been presented in 3 separeate ways all through the karnel. Lets have a look at the graph-plotting of the functions and analyze then one-by-one.
In this part we have label encoded some of the columns because some features are ordinal. I have replaced some null value with zero because in those case they probably meant that it may not exist . Finally I have merged some of the features to get a better feature.
Befor starting following block its important to understand which feature means what so that describing my work would be easier
all_df = pd.DataFrame(index =all_data.index)
LotFrontage: Linear feet of street connected to property. Now this property of the house is most likely going to be similiar to the other ones in the neighbourhood. ´SO let us group and impute with the median (due to potential outliers)
size_based_feature_analysis("LotFrontage")
all_df["LotFrontage"] =all_data["LotFrontage"]
for key, group in lot_frontage_by_neighborhood:
#Filling in missing LotFrontage values by the median
idx = (all_data["Neighborhood"] == key) & (all_data["LotFrontage"].isnull())
all_df.loc[idx, "LotFrontage"] = group.median()
LotArea is a numeric value and not required to change so I am keeping as it is.
size_based_feature_analysis("LotArea")
all_df["LotArea"] =all_data["LotArea"]
MasVnrArea : NA most likely means no masonry veneer for these houses. We can fill 0 for the area and None for the type.
size_based_feature_analysis("MasVnrArea")
all_df["MasVnrArea"] =all_data["MasVnrArea"]
all_df["MasVnrArea"].fillna(0, inplace=True)
We can notice is that some variables will inherit the imputed value due to the fact that we do not have the object at hand. For example havign no garage implies for the following 3 variables that NaN means 0.
all_df["BsmtFinSF1"] =all_data["BsmtFinSF1"]
all_df["BsmtFinSF1"].fillna(0, inplace=True)
all_df["BsmtFinSF2"] =all_data["BsmtFinSF2"]
all_df["BsmtFinSF2"].fillna(0, inplace=True)
all_df["BsmtUnfSF"] =all_data["BsmtUnfSF"]
all_df["BsmtUnfSF"].fillna(0, inplace=True)
all_df["TotalBsmtSF"] =all_data["TotalBsmtSF"]
all_df["TotalBsmtSF"].fillna(0, inplace=True)
Following sections does not have any null values and all of them are live area so I am keeping them as it is
size_based_feature_analysis('GrLivArea')
size_based_feature_analysis('2ndFlrSF')
size_based_feature_analysis('1stFlrSF')
all_df["1stFlrSF"] =all_data["1stFlrSF"]
all_df["2ndFlrSF"] =all_data["2ndFlrSF"]
all_df["GrLivArea"] =all_data["GrLivArea"]
GarageArea: Size of garage in square feet. NA means no garage available. so putting '0' to make it numeric.
size_based_feature_analysis('GarageArea')
all_df["GarageArea"] =all_data["GarageArea"]
all_df["GarageArea"].fillna(0, inplace=True)
Following features are neumeric and no null is present so we can keep them as they are.
all_df["WoodDeckSF"] =all_data["WoodDeckSF"]
all_df["OpenPorchSF"] =all_data["OpenPorchSF"]
all_df["EnclosedPorch"] =all_data["EnclosedPorch"]
all_df["3SsnPorch"] =all_data["3SsnPorch"]
all_df["ScreenPorch"] =all_data["ScreenPorch"]
In the following features Null means no GarageCars or no BsmFullBath so we can fill null with zero
all_df["BsmtFullBath"] =all_data["BsmtFullBath"]
all_df["BsmtFullBath"].fillna(0, inplace=True)
all_df["BsmtHalfBath"] =all_data["BsmtHalfBath"]
all_df["BsmtHalfBath"].fillna(0, inplace=True)
all_df["GarageCars"] =all_data["GarageCars"]
all_df["GarageCars"].fillna(0, inplace=True)
Following section dont have any missing value and the values are numerical so we can keep them untouched.
type_based_feature_analysis('FullBath')
type_based_feature_analysis("HalfBath")
type_based_feature_analysis("BedroomAbvGr")
type_based_feature_analysis("KitchenAbvGr")
type_based_feature_analysis("TotRmsAbvGrd")
type_based_feature_analysis("Fireplaces")
type_based_feature_analysis("OverallQual")
type_based_feature_analysis("OverallCond")
all_df["FullBath"] =all_data["FullBath"]
all_df["HalfBath"] =all_data["HalfBath"]
all_df["BedroomAbvGr"] =all_data["BedroomAbvGr"]
all_df["KitchenAbvGr"] =all_data["KitchenAbvGr"]
all_df["TotRmsAbvGrd"] =all_data["TotRmsAbvGrd"]
all_df["Fireplaces"] =all_data["Fireplaces"]
all_df["OverallQual"] =all_data["OverallQual"]
all_df["OverallCond"] =all_data["OverallCond"]
In this feature there is only two option either yes or no so converting it to binary will help.
type_based_feature_analysis('CentralAir')
all_df["CentralAir"] = (all_data["CentralAir"] == "Y") * 1.0
Following case are ordinal so we are performing label encoding here. In the following section meaning of the orders are:
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
now we can easily convert them to 0 to 5
In the following graph we can see relationship with saleprice
type_based_feature_analysis('BsmtQual')
type_based_feature_analysis('ExterQual')
type_based_feature_analysis('ExterCond')
type_based_feature_analysis('BsmtQual')
type_based_feature_analysis('HeatingQC')
type_based_feature_analysis('KitchenQual')
type_based_feature_analysis('FireplaceQu')
type_based_feature_analysis('GarageQual')
nan = float('nan')
qual_dict = {nan: 0, "NA": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}
all_df["ExterQual"] =all_data["ExterQual"].map(qual_dict).astype(int)
all_df["ExterCond"] =all_data["ExterCond"].map(qual_dict).astype(int)
all_df["BsmtQual"] =all_data["BsmtQual"].map(qual_dict).astype(int)
all_df["BsmtCond"] =all_data["BsmtCond"].map(qual_dict).astype(int)
all_df["HeatingQC"] =all_data["HeatingQC"].map(qual_dict).astype(int)
all_df["KitchenQual"] =all_data["KitchenQual"].map(qual_dict).astype(int)
all_df["FireplaceQu"] =all_data["FireplaceQu"].map(qual_dict).astype(int)
all_df["GarageQual"] =all_data["GarageQual"].map(qual_dict).astype(int)
all_df["GarageCond"] =all_data["GarageCond"].map(qual_dict).astype(int)
I have converted few more ordinal features to neumerical feature by performing lable encoding. Lets look at their relation with saleprice and the occurance.
type_based_feature_analysis('BsmtExposure')
type_based_feature_analysis('BsmtFinType1')
type_based_feature_analysis('BsmtFinType2')
type_based_feature_analysis('Functional')
type_based_feature_analysis('GarageFinish')
type_based_feature_analysis('KitchenQual')
type_based_feature_analysis('Fence')
all_df["BsmtExposure"] =all_data["BsmtExposure"].map(
{nan: 0, "No": 1, "Mn": 2, "Av": 3, "Gd": 4}).astype(int)
bsmt_fin_dict = {nan: 0, "Unf": 1, "LwQ": 2, "Rec": 3, "BLQ": 4, "ALQ": 5, "GLQ": 6}
all_df["BsmtFinType1"] =all_data["BsmtFinType1"].map(bsmt_fin_dict).astype(int)
all_df["BsmtFinType2"] =all_data["BsmtFinType2"].map(bsmt_fin_dict).astype(int)
all_df["Functional"] =all_data["Functional"].map(
{nan: 0, "Sal": 1, "Sev": 2, "Maj2": 3, "Maj1": 4,
"Mod": 5, "Min2": 6, "Min1": 7, "Typ": 8}).astype(int)
all_df["GarageFinish"] =all_data["GarageFinish"].map(
{nan: 0, "Unf": 1, "RFn": 2, "Fin": 3}).astype(int)
all_df["Fence"] =all_data["Fence"].map(
{nan: 0, "MnWw": 1, "GdWo": 2, "MnPrv": 3, "GdPrv": 4}).astype(int)
PoolQC: NA means "No Pool"
type_based_feature_analysis('PoolQC')
all_df["PoolQC"] =all_data["PoolQC"].map(qual_dict).astype(int)
Following features are year based except MoSold so lets see how they relate to Sale price. If we use type based graph it would be easy to realize which month the selling is higher and costlier.
year_based_feature_analysis('YearBuilt')
year_based_feature_analysis('YearRemodAdd')
year_based_feature_analysis('GarageYrBlt')
type_based_feature_analysis('MoSold')
year_based_feature_analysis('YrSold')
In above section we can see that year does not matter much on Saleprice but in the month of June and July selling is higher and property is more costlier.
Now we need to change only GarageYrBlt because there is null value present in dataset. If there is no garageYrBlt then we can simply put 0 in the section of year.
all_df["YearBuilt"] =all_data["YearBuilt"]
all_df["YearRemodAdd"] =all_data["YearRemodAdd"]
all_df["GarageYrBlt"] =all_data["GarageYrBlt"]
all_df["GarageYrBlt"].fillna(0.0, inplace=True)
all_df["MoSold"] =all_data["MoSold"]
all_df["YrSold"] =all_data["YrSold"]
type_based_feature_analysis('LowQualFinSF')
type_based_feature_analysis('MiscVal')
It seems that MiscVal and LowQualFinSF is little bit skewed we will check them later when we adjust skewness. Lets keep them as they are until then.
all_df["LowQualFinSF"] =all_data["LowQualFinSF"]
all_df["MiscVal"] =all_data["MiscVal"]
If pool area is not available or null then that means no pool is available
all_df["PoolQC"] =all_data["PoolQC"].map(qual_dict).astype(int)
all_df["PoolArea"] =all_data["PoolArea"]
all_df["PoolArea"].fillna(0, inplace=True)
In the following section we have categorical features. For example, our first column MSSUbClass should be actually categorical, and not only that but with some hiararchy also (since there is difference whether it is 120 or 20)
NOTE if we do not labelENCODE numerical variables BEFORE we apply dummy encoding, than these variables will never be encoded. Since dummy encoding works only on categorical variables.
Again in factorize function we are actually filling null values and then lable encoded them.
type_based_feature_analysis('MSSubClass')
type_based_feature_analysis('LotConfig')
type_based_feature_analysis('Neighborhood')
type_based_feature_analysis('Condition1')
type_based_feature_analysis('BldgType')
type_based_feature_analysis('HouseStyle')
type_based_feature_analysis('RoofStyle')
type_based_feature_analysis('Foundation')
type_based_feature_analysis('SaleCondition')
# Add categorical features as numbers too. It seems to help a bit.
all_df = factorize(all_data, all_df, "MSSubClass")
all_df = factorize(all_data, all_df, "LotConfig")
all_df = factorize(all_data, all_df, "Neighborhood")
all_df = factorize(all_data, all_df, "Condition1")
all_df = factorize(all_data, all_df, "BldgType")
all_df = factorize(all_data, all_df, "HouseStyle")
all_df = factorize(all_data, all_df, "RoofStyle")
all_df = factorize(all_data, all_df, "Foundation")
all_df = factorize(all_data, all_df, "SaleCondition")
type_based_feature_analysis('MSZoning')
For MSZoning: since "RL" is the most common values, we are going to use mode to impute it.
all_df = factorize(all_data, all_df, "MSZoning", "RL")
type_based_feature_analysis('Exterior1st')
type_based_feature_analysis('Exterior2nd')
type_based_feature_analysis('SaleType')
Exterior1st and Exterior2nd and SaleType have just one missing values (both of them are strings!) so we are just going to impute with the most common string.
all_df = factorize(all_data, all_df, "Exterior1st", "Other")
all_df = factorize(all_data, all_df, "Exterior2nd", "Other")
all_df = factorize(all_data, all_df, "SaleType", "Oth")
type_based_feature_analysis('MasVnrType')
MasVnrType: NA most likely means no masonry veneer for these houses. We can fill 0 for the area and None for the type.
all_df = factorize(all_data, all_df, "MasVnrType", "None")
In following code I am converting values of those features as 0 or 1
IR2 and IR3 don't appear that often, so just make a distinction between regular and irregular.
all_df["IsRegularLotShape"] = (all_data["LotShape"] == "Reg") * 1
Most properties are level; bin the other possibilities together as "not level".
all_df["IsLandLevel"] = (all_data["LandContour"] == "Lvl") * 1
type_based_feature_analysis('LandSlope')
Most land slopes are gentle; treat the others as "not gentle".
all_df["IsLandSlopeGentle"] = (all_data["LandSlope"] == "Gtl") * 1
Most properties use standard circuit breakers.
all_df["IsElectricalSBrkr"] = (all_data["Electrical"] == "SBrkr") * 1
About 2/3rd have an attached garage.
all_df["IsGarageDetached"] = (all_data["GarageType"] == "Detchd") * 1
type_based_feature_analysis('PavedDrive')
Most have a paved drive. Treat dirt/gravel and partial pavement as "not paved".
all_df["IsPavedDrive"] = (all_data["PavedDrive"] == "Y") * 1
The only interesting "misc. feature" is the presence of a shed.
all_df["HasShed"] = (all_data["MiscFeature"] == "Shed") * 1.
If YearRemodAdd != YearBuilt, then a remodeling took place at some point.
all_df["Remodeled"] = (all_df["YearRemodAdd"] != all_df["YearBuilt"]) * 1
Did a remodeling happen in the year the house was sold?
all_df["RecentRemodel"] = (all_df["YearRemodAdd"] == all_df["YrSold"]) * 1
Was this house sold in the year it was built?
all_df["VeryNewHouse"] = (all_df["YearBuilt"] == all_df["YrSold"]) * 1
converting following features similarly
all_df["Has2ndFloor"] = (all_df["2ndFlrSF"] == 0) * 1
all_df["HasMasVnr"] = (all_df["MasVnrArea"] == 0) * 1
all_df["HasWoodDeck"] = (all_df["WoodDeckSF"] == 0) * 1
all_df["HasOpenPorch"] = (all_df["OpenPorchSF"] == 0) * 1
all_df["HasEnclosedPorch"] = (all_df["EnclosedPorch"] == 0) * 1
all_df["Has3SsnPorch"] = (all_df["3SsnPorch"] == 0) * 1
all_df["HasScreenPorch"] = (all_df["ScreenPorch"] == 0) * 1
type_based_feature_analysis('MoSold')
type_based_feature_analysis('MSSubClass')
Following portion was calculated with the commented part of the code and by observing graph. Instead of the fraction value putting binary value helps for generalization.
We can see that most of the selling happens in April, may , june and July. For MSSubClass 20 , 60, 120 is most costly and we put 1 for them.
# Months with the largest number of deals may be significant.
# mx = max(train["MoSold"].groupby(train["MoSold"]).count())
# all_df["HighSeason"] =all_data["MoSold"].replace(
# train["MoSold"].groupby(train["MoSold"]).count()/mx)
# mx = max(train["MSSubClass"].groupby(train["MSSubClass"]).count())
# all_df["NewerDwelling"] =all_data["MSSubClass"].replace(
# train["MSSubClass"].groupby(train["MSSubClass"]).count()/mx)
all_df["HighSeason"] =all_data["MoSold"].replace(
{1: 0, 2: 0, 3: 0, 4: 1, 5: 1, 6: 1, 7: 1, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0})
all_df["NewerDwelling"] =all_data["MSSubClass"].replace(
{20: 1, 30: 0, 40: 0, 45: 0,50: 0, 60: 1, 70: 0, 75: 0, 80: 0, 85: 0,
90: 0, 120: 1, 150: 0, 160: 0, 180: 0, 190: 0})
year_based_feature_analysis('Neighborhood')
According to graph we put top 5 costly place to neighborhood as good place to live in. And make others zero.
all_df.loc[all_data.Neighborhood == 'NridgHt', "Neighborhood_Good"] = 1
all_df.loc[all_data.Neighborhood == 'Crawfor', "Neighborhood_Good"] = 1
all_df.loc[all_data.Neighborhood == 'StoneBr', "Neighborhood_Good"] = 1
all_df.loc[all_data.Neighborhood == 'Somerst', "Neighborhood_Good"] = 1
all_df.loc[all_data.Neighborhood == 'NoRidge', "Neighborhood_Good"] = 1
all_df["Neighborhood_Good"].fillna(0, inplace=True)
type_based_feature_analysis('SaleCondition')
House completed before sale or not if partially completed then put 0.
# House completed before sale or not
all_df["SaleCondition_PriceDown"] =all_data.SaleCondition.replace(
{'Abnorml': 1, 'Alloca': 1, 'AdjLand': 1, 'Family': 1, 'Normal': 0, 'Partial': 0})
# House completed before sale or not
all_df["BoughtOffPlan"] =all_data.SaleCondition.replace(
{"Abnorml" : 0, "Alloca" : 0, "AdjLand" : 0, "Family" : 0, "Normal" : 0, "Partial" : 1})
all_df["BadHeating"] =all_data.HeatingQC.replace(
{'Ex': 0, 'Gd': 0, 'TA': 0, 'Fa': 1, 'Po': 1})
Total area covered by the property is TotalArea and total liveable place is tatalara1st2nd . These features are here because they are more normal and thus easy to work with.
area_cols = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF',
'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'LowQualFinSF', 'PoolArea' ]
all_df["TotalArea"] = all_df[area_cols].sum(axis=1)
all_df["TotalArea1st2nd"] = all_df["1stFlrSF"] + all_df["2ndFlrSF"]
Price usually drops when property gets old so generating this feature might help.
all_df["Age"] = 2010 - all_df["YearBuilt"]
all_df["TimeSinceSold"] = 2010 - all_df["YrSold"]
all_df["SeasonSold"] = all_df["MoSold"].map({12:0, 1:0, 2:0, 3:1, 4:1, 5:1,
6:2, 7:2, 8:2, 9:3, 10:3, 11:3}).astype(int)
all_df["YearsSinceRemodel"] = all_df["YrSold"] - all_df["YearRemodAdd"]
In this section I have created some new feature based on my understanding of the graph. After this simplification I have found that my score in kaggle improved a lot. This section is done through graph analysis only so no explanation added due to that reason.
type_based_feature_analysis('OverallQual')
all_df["SimplOverallQual"] = all_df.OverallQual.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2, 7 : 3, 8 : 3, 9 : 3, 10 : 3})
type_based_feature_analysis('OverallCond')
all_df["SimplOverallCond"] = all_df.OverallCond.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2, 7 : 3, 8 : 3, 9 : 3, 10 : 3})
type_based_feature_analysis('PoolQC')
all_df["SimplPoolQC"] = all_df.PoolQC.replace(
{1 : 1, 2 : 1, 3 : 2, 4 : 2})
type_based_feature_analysis('GarageCond')
all_df["SimplGarageCond"] = all_df.GarageCond.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
type_based_feature_analysis('GarageQual')
all_df["SimplGarageQual"] = all_df.GarageQual.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
type_based_feature_analysis('FireplaceQu')
all_df["SimplFireplaceQu"] = all_df.FireplaceQu.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
type_based_feature_analysis('Functional')
all_df["SimplFunctional"] = all_df.Functional.replace(
{1 : 1, 2 : 1, 3 : 2, 4 : 2, 5 : 3, 6 : 3, 7 : 3, 8 : 4})
type_based_feature_analysis('KitchenQual')
all_df["SimplKitchenQual"] = all_df.KitchenQual.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
type_based_feature_analysis('HeatingQC')
all_df["SimplHeatingQC"] = all_df.HeatingQC.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
type_based_feature_analysis('BsmtFinType1')
all_df["SimplBsmtFinType1"] = all_df.BsmtFinType1.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2})
type_based_feature_analysis('BsmtFinType2')
all_df["SimplBsmtFinType2"] = all_df.BsmtFinType2.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2})
type_based_feature_analysis('BsmtCond')
all_df["SimplBsmtCond"] = all_df.BsmtCond.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
type_based_feature_analysis('BsmtQual')
all_df["SimplBsmtQual"] = all_df.BsmtQual.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
type_based_feature_analysis('ExterCond')
all_df["SimplExterCond"] = all_df.ExterCond.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
type_based_feature_analysis('ExterQual')
all_df["SimplExterQual"] = all_df.ExterQual.replace(
{1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
year_based_feature_analysis('Neighborhood')
Bin by neighborhood (a little arbitrarily). Values were computed by:
train_df["SalePrice"].groupby(train_df["Neighborhood"]).median().sort_values()
neighborhood_map = {
"MeadowV" : 0, # 88000
"IDOTRR" : 1, # 103000
"BrDale" : 1, # 106000
"OldTown" : 1, # 119000
"Edwards" : 1, # 119500
"BrkSide" : 1, # 124300
"Sawyer" : 1, # 135000
"Blueste" : 1, # 137500
"SWISU" : 2, # 139500
"NAmes" : 2, # 140000
"NPkVill" : 2, # 146000
"Mitchel" : 2, # 153500
"SawyerW" : 2, # 179900
"Gilbert" : 2, # 181000
"NWAmes" : 2, # 182900
"Blmngtn" : 2, # 191000
"CollgCr" : 2, # 197200
"ClearCr" : 3, # 200250
"Crawfor" : 3, # 200624
"Veenker" : 3, # 218000
"Somerst" : 3, # 225500
"Timber" : 3, # 228475
"StoneBr" : 4, # 278000
"NoRidge" : 4, # 290000
"NridgHt" : 4, # 315000
}
all_df["NeighborhoodBin"] =all_data["Neighborhood"].map(neighborhood_map)
Filled with 0 for some features like "MasVnrArea", "BsmtFinSF1" "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "GarageArea" "BsmtFullBath" "BsmtHalfBath" "GarageCars" "PoolArea" "GarageYrBlt" .According to the documentation of the dataset if these features have any field empty then that means the feature is not available. So I have done this operation according to documentation of the dataset.
CentralAir feature was given has two field only 'Y' or 'N' so I have converted that to 0 or 1
I have converted some features from categorical to numerical and those features are MSSubClass, MSZoning , LotConfig, RL , LotConfig,Neighborhood, Condition1 ,BldgType, HouseStyle , HouseStyle, Exterior1st, Other, Exterior2nd, MasVnrType, Foundation, SaleType and SaleCondition
Converted fields of some Features to 0 or 1 based on the understanding of the dataset and a little bit research. What I have done is that I have made simplified versions of existing features. For example, the Land Slope feature lets us know what type of slope the property has. Even though is has multiple labels, it all comes down to if the slope is gentle or not. Hence I have created a new feature called IsLandSlopeGentle, which is effectively tells us if the slope is gentle (==1) or is it not gentle (==0).Those features with the changing reasons are given below
Simplifications of existing features into bad/average/good. Features : SimplOverallQual, SimplOverallCond, SimplPoolQC ,SimplGarageCond, SimplGarageQual, SimplFireplaceQu ,SimplFunctional ,SimplKitchenQual, SimplHeatingQC ,SimplBsmtFinType1, SimplBsmtFinType2 ,SimplBsmtCond , SimplBsmtQual ,SimplExterCond ,SimplExterQual.
mapped neighborhood based on their quality.The mapping is as followed:
"MeadowV" : 0, # 88000
"IDOTRR" : 1, # 103000
"BrDale" : 1, # 106000
"OldTown" : 1, # 119000
"Edwards" : 1, # 119500
"BrkSide" : 1, # 124300
"Sawyer" : 1, # 135000
"Blueste" : 1, # 137500
"SWISU" : 2, # 139500
"NAmes" : 2, # 140000
"NPkVill" : 2, # 146000
"Mitchel" : 2, # 153500
"SawyerW" : 2, # 179900
"Gilbert" : 2, # 181000
"NWAmes" : 2, # 182900
"Blmngtn" : 2, # 191000
"CollgCr" : 2, # 197200
"ClearCr" : 3, # 200250
"Crawfor" : 3, # 200624
"Veenker" : 3, # 218000
"Somerst" : 3, # 225500
"Timber" : 3, # 228475
"StoneBr" : 4, # 278000
"NoRidge" : 4, # 290000
"NridgHt" : 4, # 315000
- the number after hash is actually median priece of that location.
Keeping NeighborhoodBin into a temporary DataFrame because we want to use the unscaled version later on (to one-hot encode it).
# Keeping NeighborhoodBin into a temporary DataFrame because we want to use the
# unscaled version later on (to one-hot encode it).
neighborhood_bin = pd.DataFrame(index = all_df.index)
neighborhood_bin["NeighborhoodBin"] = all_df["NeighborhoodBin"]
According to Hair et al. (2013), four assumptions should be tested:
Normality - When we talk about normality what we mean is that the data should look like a normal distribution. This is important because several statistic tests rely on this (e.g. t-statistics). In this exercise we'll just check univariate normality for 'SalePrice' (which is a limited approach). Remember that univariate normality doesn't ensure multivariate normality (which is what we would like to have), but it helps. Another detail to take into account is that in big samples (>200 observations) normality is not such an issue. However, if we solve normality, we avoid a lot of other problems (e.g. heteroscedacity) so that's the main reason why we are doing this analysis.
Homoscedasticity - Homoscedasticity refers to the 'assumption that dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s)' (Hair et al., 2013). Homoscedasticity is desirable because we want the error term to be the same across all values of the independent variables.
Linearity - The most common way to assess linearity is to examine scatter plots and search for linear patterns. If patterns are not linear, it would be worthwhile to explore data transformations. However, we'll not get into this because most of the scatter plots we've seen appear to have linear relationships.
standardization is the process of putting different variables on the same scale. This process allows you to compare scores between different types of variables. Typically, to standardize variables, you calculate the mean and standard deviation for a variable. Then, for each observed value of the variable, you subtract the mean and divide by the standard deviation.
Skewness, in basic terms, implies off-centre, so does in statistics, it means lack of symmetry. With the help of skewness, one can identify the shape of the distribution of data.
In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging.Some types of normalization involve only a rescaling, to arrive at values relative to some size variable.
We will remove skewness through normalization and then scale all the numeric features using standardization technique (Except SalePrice).
In the following part we are looking at skewness of dataset and we can see that many features are highly skewed. We will be solving it with log transformation.
The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. If the original data follows a log-normal distribution or approximately so, then the log-transformed data follows a normal or near normal distribution.
# keeping train and test data in a flag for comparison purpose
old_train_skewness_flag = all_df[:ntrain].copy()
old_test_skewness_flag = all_df[ntrain:].copy()
old_target_skewness_flag= train["SalePrice"].copy()
# old_target_skewness_flag
from scipy.stats import skew
numeric_features = all_df.dtypes[all_df.dtypes != "object"].index
skewness = all_df[numeric_features].skew(axis=0 , skipna =True)
skewness = pd.DataFrame(skewness)
plt.figure(figsize=[5,30])
# skw = sns.load_dataset(skewness)
ax = sns.barplot( y= skewness.index , x=skewness[0] , data = skewness)
plt.show()
Observation
To apply a log transformation here, we need to add 1 and then perform log transform operation. Note : For real-valued input, log1p is accurate also for x so small that 1 + x == 1 in floating-point accuracy.
numeric_features = all_df.dtypes[all_df.dtypes != "object"].index
# Transform the skewed numeric features by taking log(feature + 1).
# This will make the features more normal.
from scipy.stats import skew
skewed = all_df[numeric_features].apply(lambda x: skew(x.dropna().astype(float)))
skewed = skewed[(skewed < -0.75) | (skewed > 0.75)]
skewed = skewed.index
all_df[skewed] = np.log1p(all_df[skewed])
# Additional processing: scale the data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(all_df[numeric_features])
for i, col in enumerate(numeric_features):
all_df[col] = scaled[:, i]
from scipy.stats import skew
numeric_features = all_df.dtypes[all_df.dtypes != "object"].index
skewness = all_df[numeric_features].skew(axis=0 , skipna =True)
skewness = pd.DataFrame(skewness)
plt.figure(figsize=[5,30])
# skw = sns.load_dataset(skewness)
ax = sns.barplot( y= skewness.index , x=skewness[0] , data = skewness)
plt.show()
We can see that skewness of the following features decreased a lot:
But other numeric features also improved its skewness a little bit.
train_processed = all_df[:ntrain]
test_processed = all_df[ntrain:]
print("shape of train :" , train_processed.shape)
print("shape of test :" , test_processed.shape)
In this section we will observe how Distribution plot changes due to normalization and standardization of the numeric features. In the first line of plot we would be able to see the distribution before skewness section starts and every second line we will see how it changes due to skewness removal and standardization. Fig-2 is the distribution plot so we should observe it carefully. We can observe that how much skewness of the data is lost due to normalization. Fig-3 will show the relation between SalePrice and and the feature. If the relation between them is linear or close to linear then that will help us in training.
from IPython.display import Markdown, display
def printmd(string):
display(Markdown("***"+string+"***"))
printmd('Before skewness removal:')
outlier_check_plot('LotArea',old_train_skewness_flag, old_test_skewness_flag, old_target_skewness_flag)
printmd('After skewness removal:')
outlier_check_plot('LotArea' , train_processed, test_processed, old_target_skewness_flag)
printmd('Before skewness removal:')
outlier_check_plot('WoodDeckSF',old_train_skewness_flag, old_test_skewness_flag, old_target_skewness_flag)
printmd('After skewness removal:')
outlier_check_plot('WoodDeckSF', train_processed, test_processed, old_target_skewness_flag)
printmd('Before skewness removal:')
outlier_check_plot('OpenPorchSF',old_train_skewness_flag, old_test_skewness_flag, old_target_skewness_flag)
printmd('After skewness removal:')
outlier_check_plot('OpenPorchSF', train_processed, test_processed, old_target_skewness_flag)
printmd('Before skewness removal:')
outlier_check_plot('ExterCond',old_train_skewness_flag, old_test_skewness_flag, old_target_skewness_flag)
printmd('After skewness removal:')
outlier_check_plot('ExterCond', train_processed, test_processed, old_target_skewness_flag)
printmd('Before skewness removal:')
outlier_check_plot('MiscVal',old_train_skewness_flag, old_test_skewness_flag, old_target_skewness_flag)
printmd('After skewness removal:')
outlier_check_plot('MiscVal', train_processed, test_processed, old_target_skewness_flag)
printmd('Before skewness removal:')
outlier_check_plot('TotalArea',old_train_skewness_flag, old_test_skewness_flag, old_target_skewness_flag)
printmd('After skewness removal:')
outlier_check_plot('TotalArea', train_processed, test_processed, old_target_skewness_flag)
Most of the scatterplot now seems that they have more linear relationship with saleprice and the distribution graphs are less skewed and close to normal distribution. Finally due to standarization all of the features are now in same scale this will also help us to converge. We can see that the distribution improved a little bit due to log transformation.
total = all_df.isnull().sum().sort_values(ascending=False)
percent = (all_df.isnull().sum()/all_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1,
keys=['Total', 'Percent' ])
missing_data.head()
Now there is no missing data in any of the train or test dataset so we can proceed further.
According to Hair et al. (2013), four assumptions should be tested:
Normality - When we talk about normality what we mean is that the data should look like a normal distribution. This is important because several statistic tests rely on this (e.g. t-statistics). In this exercise we'll just check univariate normality for 'SalePrice' (which is a limited approach). Remember that univariate normality doesn't ensure multivariate normality (which is what we would like to have), but it helps. Another detail to take into account is that in big samples (>200 observations) normality is not such an issue. However, if we solve normality, we avoid a lot of other problems (e.g. heteroscedacity) so that's the main reason why we are doing this analysis.
Homoscedasticity - Homoscedasticity refers to the 'assumption that dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s)' (Hair et al., 2013). Homoscedasticity is desirable because we want the error term to be the same across all values of the independent variables.
Linearity - The most common way to assess linearity is to examine scatter plots and search for linear patterns. If patterns are not linear, it would be worthwhile to explore data transformations. However, we'll not get into this because most of the scatter plots we've seen appear to have linear relationships.
'SalePrice' is not normal. It shows 'peakedness', positive skewness and does not follow the diagonal line. But a simple data transformation can solve the problem.
from scipy.stats import norm
from scipy import stats
#histogram and normal probability plot
sns.distplot(target, fit=norm);
fig = plt.figure()
res = stats.probplot(target, plot=plt)
We take the log here because the error metric is between the log of the SalePrice and the log of the predicted price. That does mean we need to exp() the prediction to get an actual sale price.
temp_var = target.values.copy()
temp_var = np.log(temp_var)
target = pd.DataFrame(temp_var, columns=['SalePrice'])
# target["SalePrice"] = np.log(temp_var)
# train_processed.drop(["SalePrice"], axis=1, inplace=True)
print("Training set size:", train_processed.shape)
print("Test set size:", test_processed.shape)
Now we can see the following graph is normal and the probability plot reflects linearity.
print(train_processed.shape)
print(target.shape)
# print( test_processed.shape)
# print(train.shape)
from scipy.stats import norm
from scipy import stats
#histogram and normal probability plot
sns.distplot(target['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(target['SalePrice'], plot=plt)
train_processed = all_df[:ntrain]
test_processed = all_df[ntrain:]
print("shape of train :" , train_processed.shape)
print("shape of test :" , test_processed.shape)
In this section we are checking again If any outlier remains after all the data processing. And the distribution plot will help us to realize the difference before and after normalization. Most of them became more close to normal distribution and less skewed after the processing. So we are not going to normalize them again.
from IPython.display import Markdown, display
def printmdmd(string):
display(Markdown("***"+string+"***"))
printmdmd('Before outlier-removal:')
outlier_check_plot('1stFlrSF',old_train_outlier_flag, old_test_outlier_flag, old_train_outlier_flag.SalePrice)
printmd('After outlier-removal:')
outlier_check_plot('1stFlrSF' , train_processed, test_processed, target.SalePrice)
printmd('Before outlier-removal:')
outlier_check_plot('BsmtFinSF1',old_train_outlier_flag, old_test_outlier_flag, old_target_outlier_flag)
printmd('After outlier-removal:')
outlier_check_plot('BsmtFinSF1', train_processed, test_processed, target.SalePrice)
printmd('Before outlier-removal:')
outlier_check_plot('LotArea',old_train_outlier_flag, old_test_outlier_flag, old_target_outlier_flag)
printmd('After outlier-removal:')
outlier_check_plot('LotArea', train_processed, test_processed, target.SalePrice)
printmd('Before outlier-removal:')
outlier_check_plot('GrLivArea',old_train_outlier_flag, old_test_outlier_flag, old_target_outlier_flag)
printmd('After outlier-removal:')
outlier_check_plot('GrLivArea', train_processed, test_processed, target.SalePrice)
printmd('Before outlier-removal:')
outlier_check_plot('MasVnrArea',old_train_outlier_flag, old_test_outlier_flag, old_target_outlier_flag)
printmd('After outlier-removal:')
outlier_check_plot('MasVnrArea', train_processed, test_processed, target.SalePrice)
printmd('Before outlier-removal:')
outlier_check_plot('TotalBsmtSF',old_train_outlier_flag, old_test_outlier_flag, old_target_outlier_flag)
printmd('After outlier-removal:')
outlier_check_plot('TotalBsmtSF', train_processed, test_processed, target.SalePrice)
printmd('Before outlier-removal:')
outlier_check_plot('TotalBsmtSF',old_train_outlier_flag, old_test_outlier_flag, old_target_outlier_flag)
printmd('After outlier-removal:')
outlier_check_plot('TotalBsmtSF', train_processed, test_processed, target.SalePrice)
Most of the scatterplot now seems that they have linear relationship with saleprice and the distribution graphs are less skewed and close to normal distribution. Finally due to standarization all of the features are now in same scale this will also help us to converge.
This time we can see that the distribution improved a little bit due to log transformation and I was expecting that few outliers we observed earlier are no longer seems to be a outlier. Only the common outlier was the actual source of the problem. So we can now proceed to feed these data to our model.
print(train_processed.shape)
print(target.shape)
# print( test_processed.shape)
# print(train.shape)
abc = train_processed.copy()
abc['SalePrice'] = target.SalePrice.copy()
#correlation matrix
corrmat = abc.corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.set(font_scale=1.25)
sns.heatmap(corrmat, vmax=.8, square=True);
We can see that above graph is almost completely red that means no feature have any relation with another feature. That means all the features are now independent. So our data processing part should be good enough to get good results.
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(abc[cols].values.T)
f, ax = plt.subplots(figsize=(15, 12))
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
We can see that GrLivArea and TotalArea1st2nd is very close and in the following graph we will see that the graph is also same for both of the feature. But by dropping one of them performance does not improves rather decreases sometime so I am keeping both of them.
The following 9 features are the most important feature for determining the SalePrice and they also don't have any outlier
# A FUNCTION TO SCATTER-PLOT ALL SELECTED FEATURES AGAINST SALEPRICE
def relation_with_SalePrice(c,column):
plt.subplot(5, 4, c)
plt.scatter(x = train_processed[column], y = target.SalePrice)
plt.xlabel(column)
c=1
sns.set(font_scale=1)
plt.subplots(figsize=(19, 19))
if 'SalePrice' in cols:
cols = cols.drop('SalePrice')
for item in cols:
relation_with_SalePrice(c,item)
c=c+1
plt.show()
In this Section We have split the training dataset into two part. First one is called train and other is called val (means validation set).Training set contains X_train and y_train.Validation set also contains X_val and y_val. X means this SalePrice is excluded. Again Y means this portion only contains Saleprice. I have used 80-20 split where training contains 80% data and validation contains 20% data. I have used kaggle testing set for testing them (variable name is test_processed) and the result of the kaggle testing is also included as a screenshot after accuracy section.
X_train, X_val, y_train, y_val = train_test_split(train_processed,
target,
# train_size = 0.99,
test_size = 0.2,
random_state = 0,
shuffle = True
)
Following section changes the training set to 100% when we set submit=True. The reason behind it is when we train with full dataset then we use to get better accuracy. But we will set that True only when we are going to submit the prediction of the tess_proceed to kaggle.
prediction_dict = dict()
submit_prediction_dict = dict()
submit = False
save_score = False
if submit :
X_train = train_processed
y_train = target
else:
X_train = X_train
y_train = y_train
Following function calculates root mean squire error
What is RMSE ?
The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) (or sometimes root-mean-squared error) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. These deviations are called residuals when the calculations are performed over the data sample that was used for estimation and are called errors (or prediction errors) when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent.[1]
def rmse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))
A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
my_model = RandomForestRegressor(n_estimators=500,n_jobs=-1)
my_model.fit(X_train, y_train)
prediction = my_model.predict(X_val)
if submit:
submit_prediction = my_model.predict(test_processed)
submit_prediction_dict['Random Forest Regressor'] = submit_prediction
prediction_dict['Random Forest Regressor'] = prediction
print('root mean absolute error: ',rmse(y_val, prediction))
print('accuracy score: ', r2_score(np.array(y_val),prediction) )
Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
from sklearn.tree import DecisionTreeRegressor
my_model = DecisionTreeRegressor()
my_model.fit(X_train, y_train)
prediction = my_model.predict(X_val)
prediction_dict['DecisionTree'] = prediction
if submit:
submit_prediction = my_model.predict(test_processed)
submit_prediction_dict['DecisionTree'] = submit_prediction
print('root mean absolute error: ',rmse(y_val, prediction))
print('accuracy score: ', r2_score(np.array(y_val),prediction) )
XGBoost stands for eXtreme Gradient Boosting. It is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.
from xgboost import XGBRegressor
my_model = XGBRegressor(n_estimators=500, learning_rate=0.05)
my_model.fit(X_train, y_train)
prediction = my_model.predict(X_val)
prediction_dict['Xgboost'] = prediction
if submit:
submit_prediction = my_model.predict(test_processed)
submit_prediction_dict['Xgboost'] = submit_prediction
print('root mean absolute error: ',rmse(y_val, prediction))
print('accuracy score: ', r2_score(np.array(y_val),prediction) )
Lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces.Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear.
from sklearn.linear_model import Lasso
my_model = Lasso(alpha=5e-3, max_iter=50000)
my_model.fit(X_train, y_train)
prediction = my_model.predict(X_val)
prediction_dict['Lasso'] = prediction
if submit:
submit_prediction = my_model.predict(test_processed)
submit_prediction_dict['Lasso'] = submit_prediction
print(' root mean absolute error: ',rmse(y_val, prediction))
print('accuracy score: ', r2_score(np.array(y_val),prediction) )
In the above model alpha is Constant that multiplies the L1 term. For numerical reason we cant set alpha to 0 but keeping alpha low provides good accuracy for out dataset. I have found 5e-4 provides good accuracy.
for 5e-5: root mean absolute error: 0.10973737757187135 accuracy score: 0.9289433650407954
for 1e-5: root mean absolute error: 0.11426822609093419 accuracy score: 0.9229546464396043
for 1e-3: root mean absolute error: 0.10466883446067998 accuracy score: 0.9353556969018821
for 1e-4: root mean absolute error: 0.10658498063306822 accuracy score: 0.9329671780226085
for 5e-3: root mean absolute error: 0.10794617678311977 accuracy score: 0.9312440935471524
An Artificial Neurol Network (ANN) is a computational model. It is based on the structure and functions of biological neural networks. It works like the way human brain processes information. ANN includes a large number of connected processing units that work together to process information. They also generate meaningful results from it.
An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Usually each input is separately weighted, and the sum is passed through a non-linear function known as an activation function or transfer function.
The artificial Neural network is typically organized in layers. Layers are being made up of many interconnected ‘nodes’ which contain an ‘activation function’. A neural network may contain the following 3 layers:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=[10,5])
plt.scatter(range(len(train)),list(target.SalePrice.values))
plt.show()
plt.figure(figsize=[10,5])
sns.kdeplot(target.SalePrice, shade= True)
plt.show()
In the above graph we can see that the price range is in a normal distribution. If we provide tf.random.normal while initializing the weight it should be more helpful for training. And this initialization should provide better validation with low amount of epoches. In my kaggle score rmse 0.123 is found through random normal while uniform distribution provided rmse 0.127 score. Again Uniform distribution takes 3 times more epoches to reach rmse score 0.127. But for uniform distribution no improvement cant found after 16000 epoch and for normal distribution no improvement can't found after 6000 epoch.
By observing the span of the data and the data distribution we can conclude that logistic regression should perform well for this kind of problem. So we can safely say that starting with single neuron in a single hidden layer should perform well and we should look for simpler solution. Again from theoretical perspective single neurone and single layer ANN is nothing but a logistic regression and after adding layers and neurons we can regularize them so that they behave more like a logistic regression model and then we can tune parameter such a way so that it can handle little bit more complexity than a logistic regression. Finally my target is to make sure that it performs well as a logistic regression model and then improve it with more neuron/layers and proper tuning of parameters.
# log_df = pd.DataFrame(columns=['learning_rate', 'num_steps', 'beta1','beta2','beta3', 'hidden_1' , 'hidden_2', 'hidden_3','input_dim' , 'test_rmse_score', 'test_r2_score'])
# log_df.to_csv("diffrent_training_results.csv", index=False)
A brief explanation of the variables used is given below. Some terminologies are explained in more detail when their usage comes up.
learning_rate: On a intuition level, learning rate means how fast the network will learn something new and discard the old one. On a technical level, learning rate determines how fast the 'weights' will be updated. Learning rate should be high enough so that it won’t take too long to converge, and it should be low enough so that it is able to find the minima.
epoch: The number of times the model will be trained. After each run, the 'weights' will be updated by the means of 'optimizer'
beta1/2/3 : These variables control how much penalty to add to the model's loss function.
hidden_1/2/3 = Determines how many neurons a layer has. The number after the 'hidden_' part denotes the layer number. i.e. 2 means second hidden layer
input_dim: Determines the shape of the input matrix. The input size is the same as the number of features the dataset has.
output_dim: Determines the shape of the final output. As this is a regression problem the ouput is of size one.
X_tf/y_tf: These two are tensorflow placeholder variables. They take input during the training period.
loss for loss function I have used mean squared error.
The following ANN is build with 3 hidden layers. Output dimention is 1 because its a regration problem.
tf.reset_default_graph()
learning_rate = 0.1
num_steps = 8000
#for regularize weight matrix
beta1 = 0.1
beta2 = 0.0
beta3 = 0.0
beta4 = None
hidden_1 = 16
hidden_2 = 8
hidden_3 = 4
hidden_4 = None
# minimum_validation_loss is to control model saving locally
minimum_validation_loss = 0.0190000
input_dim = X_train.shape[1] # Number of features
output_dim = 1 # Because it is a regression problem
#tf graph input
X_tf = tf.placeholder("float" )
y_tf = tf.placeholder("float" )
A weight decides how much influence the input will have on the output. A weight represent the strength of the connection between units. When a value arrives at a neuron, the value gets multiplied by a weight value.
Bias is an extra input to neurons and has it’s own connection weight. But a bias node is not connected to any node in the previous layer, only connected to the next layer. This makes sure that even when all the inputs are none (all 0’s) there’s gonna be an activation in the neuron.
Here I have initialized the "weight" and "bias" variables as "random normal", which takes some random values from a normal distribution to use. Now there is also the option to set them all to "zero". But there is a problem to that. If all of the weights are the same, they will all have the same error and the model will not learn anything - there is no source of asymmetry between the neurons.That's why the better method is to keep the weights very close to zero but make them different by initializing them to small, non-zero numbers. With default parameters, "random normal" chooses values from a nomal distribution whose mean is 0 (zero) and has a standard deviation of 1 (one).
weights = {
'w1': tf.Variable(tf.random_normal([input_dim, hidden_1])),
'w2': tf.Variable(tf.random_normal([hidden_1, hidden_2])),
'w3': tf.Variable(tf.random_normal([hidden_2, hidden_3])),
'out': tf.Variable(tf.random_normal([hidden_3, output_dim]))
}
biases = {
'b1': tf.Variable(tf.random_normal([hidden_1])),
'b2': tf.Variable(tf.random_normal([hidden_2])),
'b3': tf.Variable(tf.random_normal([hidden_3])),
'out': tf.Variable(tf.random_normal([output_dim]))
}
The following block of code is what the actual ANN model looks like. Each layer, a matrix multiplication happens and then the layer is activated by a activation function. The final output layer does not have any activation function because we are performing a regression a here.
Here, the activation function is our main concern. Currently the most popular types of Activation functions are as follows:
"Sigmoid" activation function is mathematically represented by this equation,f(x) = 1 / 1 + exp(-x) . Its output range is between 0 to 1 and it has an S - shaped curve. It is easy to understand and apply but it has "vanishing gradient" problem as well as being slow to converg. So, I have avoided using it.
"Tanh" activation function is mathematically represented by this equation,f(x) = 1 - exp(-2x) / 1 + exp(-2x). It’s output range is in between -1 to 1 i.e -1 < output < 1 . As such optimization is easier in this method but still it suffers from Vanishing gradient problem.
"ReLu" is a very popular currently due to its simplicity and ease of use. Mathematically, ReLu can be defined as follows-
R(x) = max(0,x) i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x.
From the mathamatical function it can be seen that it is very simple and efficinent. It also avoids and rectifies vanishing gradient problem . It is also relatively easier to optimize.
In the dataset Sales price are non negative number so our model is expected to return positive values so as a activation function I have used relu as it gives positive values. Again relu is easy to optimize because they are similar to linear units. The only difference is that a rectified linear unit outputs zero across half its domain. Thus derivatives through a rectified linear unit remain large whenever the unit is activate. The gradients are not only large but also consistent.
def ann_model(X_input):
# Hidden layers
layer_1 = tf.add(tf.matmul(X_input, weights['w1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
layer_2 = tf.add(tf.matmul(layer_1, weights['w2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
layer_3 = tf.add(tf.matmul(layer_2, weights['w3']), biases['b3'])
layer_3 = tf.nn.relu(layer_3)
# Output layer
layer_out = tf.matmul(layer_3, weights['out']) + biases['out']
return layer_out
For optimization I have used Adam optimizer. Adam derives from phrase “adaptive moments”. Its a varient of RMSProp. I have used adam instead of RMSProp for couple of reasons. First, in Adam, momentum is incorporated directly as an estimate of the first-order moment (with exponential weighting) of the gradient. The most straightforward way to add momentum to RMSProp is to apply momentum to the rescaled gradients. The use of momentum in combination with rescaling does not have a clear theoretical motivation. Second, Adam includes bias corrections to the estimates of both the first-order moments (the momentumterm) and the (uncentered) second-order moments to account for their initializationat the origin. RMSProp also incorporates an estimate of the (uncentered) second-order moment; however, it lacks the correction factor. Thus,unlike in Adam, the RMSProp second-order moment estimate may have high bias early in training. Adam is generally regarded as being fairly robust to the choice of hyperparameters, though the learning rate sometimes needs to be changed from the suggested default. Usually default rate is .001 but for our case I have used 0.1 as it gives better optimization results.
Following segment is actually initializing different parameters. From the dataset we can see that the estimation of sale price is a regression problem and neural network used here was overfitting most of the time due to higher variance. So for making it simpler I have penalized weight matrix of hidden layers with l2 regularization. Again I have found that single hidden layer with single neuron performs well and that means the prediction model don't need to be too complex. Thus I became ensured that regularization is going to improve performance.
# Model Construct
model = ann_model(X_tf)
# Mean Squared Error function
# loss = tf.reduce_mean(tf.square(y_tf - model))
loss = tf.losses.mean_squared_error(y_tf , model , reduction=tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS)
# loss = tf.square(y_tf - model)
regularizer_1 = tf.nn.l2_loss(weights['w1'])
regularizer_2 = tf.nn.l2_loss(weights['w2'])
regularizer_3 = tf.nn.l2_loss(weights['w3'])
loss = tf.reduce_mean(loss + beta1*regularizer_1 + beta2*regularizer_2 + beta3*regularizer_3)
# loss = loss + beta1*regularizer_1 + beta2*regularizer_2 + beta3*regularizer_3
# Adam optimizer will update weights and biases after each step
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
# Initialize variables
init = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
The training_block() is the function where all the work finally happens. The constructed model gets the training data and the training process begins. In each epoch the code is calculating the loss function and trying to minimize that value. Training loss and validation loss of each epoch gets stored in train_loss and val_loss respectively. After each 50 epochs the current loss values are added to two lists train_LC and val_LC, which is used to plot the learning curve after the training is finished. Also after each 500 epochs, I have printed the Training loss and validation loss.
During the training phase, I have run a shuffle function on the input data. This is so that when the data is input into the model, there are some variation to the serial the data gets inside the model. The reason why i have done it is so that it can have the effect of training on mini batches.
train_LC = []
val_LC = []
# session_var = None
Above train LC and val Lc variable keeps track of the learning rate so that learning curve can be drwan. In the following training block I have shuffled the training data in each epoch. This helps to reduce the loss difference of the validation and training. Thus it reduces the chance for over-fitting and under-fitting.
def training_block(X_train,y_train, X_val,y_val):
#reseting variables
session_var = None
save_path = None
with tf.Session() as sess:
#running initializer
sess.run(init)
# minimum_validation_loss = 0.0190000
global minimum_validation_loss
for i in range(num_steps):
if submit :
X_train , y_train = shuffle(train_processed,target )
else:
X_train,y_train = shuffle(X_train,y_train )
sess.run(optimizer, feed_dict={X_tf:X_train, y_tf:y_train})
train_loss = sess.run(loss, feed_dict={X_tf:X_train, y_tf:y_train})
val_loss = sess.run(loss, feed_dict={X_tf:X_val, y_tf:y_val})
if submit :
new_minimum_validation_loss = np.min(train_loss)
else:
new_minimum_validation_loss = np.min(val_loss)
# if (i+1)%50 == 0:
train_LC.append(train_loss)
val_LC.append(val_loss)
if (i+1)%500 == 0:
print("epoch no : ",i+1, " training loss: ",train_loss, " validation loss: ", val_loss, " minimum_validation_loss" , minimum_validation_loss)
if new_minimum_validation_loss < minimum_validation_loss :
minimum_validation_loss = new_minimum_validation_loss
# global session_var
# session_var = sess
# Save the variables to disk.
save_path = saver.save(sess, "model/model.ckpt")
if bool(save_path):
sess.close()
print("Model saved in path: %s" % save_path)
training_block(X_train,y_train, X_val,y_val)
In the above block I have saved the model when validation loss is lowest. To do that I have kept another parameter called minimum_validation_loss. When validation loss reach lower I save the model, update minimum_validation_loss and continue running it. If it finds another lower validation loss it saves the model again and update minimum_validation_loss. Thus when I get the lowest validation loss my model saves again and that is the most optimum result. But when I run using all the data to predict kaggle test dataset then I use training loss to do the same.
As I mentioned earlier the epoch to reach the best validation accuracy is not fixed. Rather we can find it in 3 different range of epoch. The reason behind this is mostly because of random initializing of the weight and if we have fixed the seed value then it might change into only one single epoch range. But doing so we loose chance to improve our model further. Again if we want to ensemble different ANN model it woun't help when we use same seed and state. I have tried 1000+ parameters and combination from the start and used graph to visualize how to improve that but with grid search I might not get the exact idea why certain things provide good results or not and looking into every search result and graph is also too much so applying on the epoch seems to me more reasonable solution because the epoch for best validation result will be different in every run.
I have shuffled the data in every epoch and this trick improved the validation accuracy. On the other hand I did't use batch because according to my previous experience this kind of logistic regression problem works better when its given as a whole set rather than batch or mini-batch. But if its overfitting then passing the data in a batch / mini-batch would perform better as it helps to generalize more. We can say its more like a dropout effect. And I have tried to do dropout to reduce distance of training and validation accuracy but that didn't worked well.
def Prediction_block(X_val):
with tf.Session() as sess:
try:
# Restore variables from disk.
saver.restore(sess, "model/model.ckpt")
print("Model restored.")
except:
print("------------ available checkpoint is for different model --------------")
return
# Check the values of the variables
pred = sess.run(model, feed_dict={X_tf: X_val})
prediction = pred.squeeze()
sess.close()
return prediction
# print(np.exp(prediction))
prediction = Prediction_block(X_val)
pred_str = 'ANN_base_lr'+str(learning_rate)+'_beta'+str(beta1)+'-'+str(beta2)+'-'+str(beta3)+'-'+str(beta4)+'_hidden'+str(hidden_1)+'-'+str(hidden_2)+'-'+str(hidden_3)+'-'+str(hidden_4)
prediction_dict[pred_str] = prediction
if submit:
submit_prediction = Prediction_block(test_processed)
submit_prediction_dict[pred_str] = submit_prediction
Following variables are only used to zoom into the graph
def learning_curve(start_observation_flag,end_observation_flag):
xdata = list(range(1,len(train_LC)+1))
minimum = min(train_LC)
plt.figure(figsize=[20,5])
plt.plot(xdata, train_LC, 'b--', label='Training curve')
plt.annotate('train min', xy=(xdata[train_LC.index(minimum)], minimum),
arrowprops=dict(facecolor='black', shrink=0.05))
minimum = min(val_LC)
plt.plot(xdata, val_LC, 'r--' , label='Validation curve')
plt.annotate('vali min', xy=(xdata[val_LC.index(minimum)], minimum),
arrowprops=dict(facecolor='red', shrink=0.05))
plt.legend()
plt.show()
print("If we zoom into the curve we would have seen the following")
plt.figure(figsize=[20,5])
plt.plot(xdata[start_observation_flag:end_observation_flag], train_LC[start_observation_flag:end_observation_flag], 'b--')
plt.plot(xdata[start_observation_flag:end_observation_flag], val_LC[start_observation_flag:end_observation_flag],'r--')
plt.show()
#Following variables are only used to zoom into the graph
start_observation_flag = train_LC.index( min(train_LC)) - 300
end_observation_flag = train_LC.index( min(train_LC)) + 100
learning_curve(start_observation_flag,end_observation_flag)
def plot_prediction(y_val, prediction_val):
plt.figure(figsize=[5,5])
plt.title('Compare predicted value VS real value')
sns.regplot(x= np.exp(y_val), y = np.exp(prediction_val), fit_reg=False)
sns.regplot(x=np.array([10,800000]), y=np.array([10,800000]),fit_reg=True)
plt.show()
plot_prediction( y_val, pred_df['ANN_base_lr0.1_beta0.1-0.0-0.0-None_hidden16-8-4-None'])
Both of the curve actually seems to be on top of each other.The reason is:
For loss function I have used Mean Squared Error (MSE). For reducing MSE I have used SUM_BY_NONZERO_WEIGHTS which divided scalar sum by number of non-zero weights. MSE calculates squared error for all the data and then calculate the mean. Now, all my SalePrice is very small due to normalization (between 10 to 13.5). Where mean of saleprice is 12.02 . Suppose in nth epoch if
for training loss
MSE = (.25+1+.25+.25+1.7)/5 = .69
for validation loss
MSE = (.81+1.69+1.44+.49+.04)/5 = .894
Difference between validation loss and training loss is .204
Usually in regression problem neural network stats to predicts the average value within 5-20 epoch so very quickly the difference between val_loss and training_loss gets much lower. In our dummy example difference is already .204 and if its epoch no is 10, by the time it reaches to 500 epoch the difference could go as low as 10^-4.
def accuracy(y_val,prediction):
test_rmse_score = rmse(y_val, prediction)
test_r2_score = r2_score(np.array(y_val),prediction)
return test_rmse_score, test_r2_score
test_rmse_score, test_r2_score = accuracy(y_val,prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )
In kaggle ranking the above ANN model provides the best rmse score and the score is 0.11912
We can observe where overfitting occurs. Overfitting actually occurs if the training loss goes under the validation loss even though the validation is still dropping. It is the sign that network is learning the patterns in the train set that are not applicable in the validation done. In a short note we can say::
Overfitting : training loss << validation loss
Underfitting : training loss >> validation loss
Just right : training loss ~ validation loss
According to this theory our both learning curve is exactly top of one another so in our case validation loss and training loss is almost same so we can say that our model is doing just the right thing. Again In validation score .1054 is impressive compared to other models.
if save_score:
log_df = pd.read_csv("diffrent_training_results.csv")
log_df = log_df.append({'learning_rate' : learning_rate, 'num_steps' : num_steps, 'beta1' : beta1, 'beta2' : beta2, 'beta3' : beta3, 'beta4' : beta4, 'hidden_1' : hidden_1 , 'hidden_2' : hidden_2, 'hidden_3' : hidden_3, 'hidden_4' : hidden_4, 'input_dim' : input_dim , 'test_rmse_score' : test_rmse_score , 'test_r2_score' : test_r2_score}, ignore_index=True)
log_df.to_csv("diffrent_training_results.csv", encoding='utf-8',index=False)
When we perform a random train-test split of our data, we assume that our examples are independent. That means that by knowing/seeing some instance will not help us understand other instances. However, that’s not always the case. So to make sure if the Data is actually independent, to get more metrics and to use fine tuning my parameters on whole dataset I am performing cross validation.
from sklearn.model_selection import KFold
from sklearn.model_selection import RepeatedKFold
kf = KFold(n_splits=10, shuffle=True)
kf_rmse_list = []
kf_r2_list = []
# train_processed['SalePrice'] = target.values
for train_index, test_index in kf.split(train_processed):
X_train, X_val = train_processed.iloc[train_index] , train_processed.iloc[test_index]
y_train, y_val = target.iloc[train_index], target.iloc[test_index]
training_block(X_train,y_train, X_val,y_val)
prediction = Prediction_block(X_val)
test_rmse_score, test_r2_score = accuracy(y_val, prediction)
kf_rmse_list.append(test_rmse_score)
kf_r2_list.append(test_r2_score)
print("r2 list print", kf_r2_list)
print('rmse list print',kf_rmse_list)
print("r2 mean print", np.mean(kf_r2_list))
print('rmse mean print', np.mean(kf_rmse_list))
In the cross validation section we can see that 10 fold cross validation on our best ANN model provides similar rmse to 80-20 split rmse score. So we can relay on 80-20 split on this dataset. Thus we can say that the data in the dataset is independent.
In this section We are observing the few other models and their learning curve. After that some of them will be used for Ensemble learning section for further improvement. In this model I have only changed the size of hidden layer, amount of neuron in each hidden layers , number of steps and learning rates. Rest of the part is same as the ANN described above.
tf.reset_default_graph()
def weight_bais():
global weights, biases
weights = None
biases = None
weights = {
'w1': tf.Variable(tf.random_normal([input_dim, hidden_1])),
'w2': tf.Variable(tf.random_normal([hidden_1, hidden_2])),
'w3': tf.Variable(tf.random_normal([hidden_2, hidden_3])),
'w4': tf.Variable(tf.random_normal([hidden_3, hidden_4])),
'out': tf.Variable(tf.random_normal([hidden_4, output_dim]))
}
biases = {
'b1': tf.Variable(tf.random_normal([hidden_1])),
'b2': tf.Variable(tf.random_normal([hidden_2])),
'b3': tf.Variable(tf.random_normal([hidden_3])),
'b4': tf.Variable(tf.random_normal([hidden_4])),
'out': tf.Variable(tf.random_normal([output_dim]))
}
def ann_model(X_input):
# Hidden layers
layer_1 = tf.add(tf.matmul(X_input, weights['w1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
layer_2 = tf.add(tf.matmul(layer_1, weights['w2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
layer_3 = tf.add(tf.matmul(layer_2, weights['w3']), biases['b3'])
layer_3 = tf.nn.relu(layer_3)
layer_4 = tf.add(tf.matmul(layer_3, weights['w4']), biases['b4'])
layer_4 = tf.nn.relu(layer_4)
# Output layer
# layer_out = tf.add(tf.matmul(layer_4, weights['out']), biases['out'])
layer_out = tf.matmul(layer_4, weights['out']) + biases['out']
return layer_out
regularizer_4 = None
def miscellaneous_initialization():
global model, loss , regularizer_1 , regularizer_2 ,regularizer_3, regularizer_4, optimizer , init , saver
# Model Construct
model = ann_model(X_tf)
# Mean Squared Error loss function
loss = tf.losses.mean_squared_error(y_tf , model , reduction=tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS)
# loss = tf.square(y_tf - model)
regularizer_1 = tf.nn.l2_loss(weights['w1'])
regularizer_2 = tf.nn.l2_loss(weights['w2'])
regularizer_3 = tf.nn.l2_loss(weights['w3'])
regularizer_4 = tf.nn.l2_loss(weights['w4'])
# loss = tf.reduce_mean(loss + beta1*regularizer_1 + beta2*regularizer_2 + beta3*regularizer_3)
loss = tf.reduce_mean(loss + beta1*regularizer_1 + beta2*regularizer_2 + beta3*regularizer_3 + beta4*regularizer_4)
# Adam optimizer will update weights and biases after each step
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
# Initialize variables
init = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
In this section changed variables are
| layer name | Neuron | value of beta for l2 regularization |
|---|---|---|
| 1st hidden layer | 76 Neuron | .1 |
| 2nd hidden layer | 48 Neuron | .05 |
| 3rd hidden layer | 32 Neuron | 0 |
| 4th hidden layer | 16 Neuron | 0 |
tf.reset_default_graph()
learning_rate = 0.1
num_steps = 25000
#for regularize weight matrix
beta1 = 0.1
beta2 = 0.05
beta3 = 0.0
beta4 = 0.0
hidden_1 = 76
hidden_2 = 48
hidden_3 = 32
hidden_4 = 16
minimum_validation_loss = .02101000
input_dim = X_train.shape[1] # Number of features
output_dim = 1 # Because it is a regression problem
#tf graph input
X_tf = tf.placeholder("float" )
y_tf = tf.placeholder("float" )
weight_bais()
miscellaneous_initialization()
train_LC = []
val_LC = []
training_block(X_train,y_train, X_val,y_val)
prediction = Prediction_block(X_val)
test_rmse_score, test_r2_score = accuracy(y_val,prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )
pred_str = 'ANN_lr'+str(learning_rate)+'_beta'+str(beta1)+'-'+str(beta2)+'-'+str(beta3)+'-'+str(beta4)+'_hidden'+str(hidden_1)+'-'+str(hidden_2)+'-'+str(hidden_3)+'-'+str(hidden_4)
prediction_dict[pred_str] = prediction
if submit:
submit_prediction = Prediction_block(test_processed)
submit_prediction_dict[pred_str] = submit_prediction
# Data Save
if save_score:
log_df = pd.read_csv("diffrent_training_results.csv")
log_df = log_df.append({'learning_rate' : learning_rate, 'num_steps' : num_steps, 'beta1' : beta1, 'beta2' : beta2, 'beta3' : beta3, 'beta4' : beta4, 'hidden_1' : hidden_1 , 'hidden_2' : hidden_2, 'hidden_3' : hidden_3, 'hidden_4' : hidden_4, 'input_dim' : input_dim , 'test_rmse_score' : test_rmse_score , 'test_r2_score' : test_r2_score}, ignore_index=True)
log_df.to_csv("diffrent_training_results.csv", encoding='utf-8',index=False)
#Following variables are only used to zoom into the graph
start_observation_flag = 4000
end_observation_flag = 12000
learning_curve(start_observation_flag,end_observation_flag)
plot_prediction( y_val, pred_df['ANN_lr0.1_beta0.1-0.05-0.0-0.0_hidden76-48-32-16'])
Both of the curve actually seems to be on top of each other.The reason is:
For loss function I have used Mean Squared Error (MSE). For reducing MSE I have used SUM_BY_NONZERO_WEIGHTS which divided scalar sum by number of non-zero weights. MSE calculates squared error for all the data and then calculate the mean. Now, all my SalePrice is very small due to normalization (between 10 to 13.5). Where mean of saleprice is 12.02 . Suppose in nth epoch if
for training loss
MSE = (.25+1+.25+.25+1.7)/5 = .69
for validation loss
MSE = (.81+1.69+1.44+.49+.04)/5 = .894
Difference between validation loss and training loss is .204
Usually in regression problem neural network stats to predicts the average value within 5-20 epoch so very quickly the difference between val_loss and training_loss gets much lower. In our dummy example difference is already .204 and if its epoch no is 10, by the time it reaches to 500 epoch the difference could go as low as 10^-4.
In this section changed variables are
| layer name | Neuron | value of beta for l2 regularization |
|---|---|---|
| 1st hidden layer | 8 Neuron | .005 |
| 2nd hidden layer | 32 Neuron | .1 |
| 3rd hidden layer | 16 Neuron | 0.05 |
| 4th hidden layer | 8 Neuron | 0 |
tf.reset_default_graph()
learning_rate = 0.05
num_steps = 25000
#for regularize weight matrix
beta1 = 0.005
beta2 = 0.1
beta3 = 0.05
beta4 = 0.0
hidden_1 = 8
hidden_2 = 32
hidden_3 = 16
hidden_4 = 8
minimum_validation_loss = 0.02101000
#tf graph input
X_tf = tf.placeholder("float" )
y_tf = tf.placeholder("float" )
weight_bais()
miscellaneous_initialization()
train_LC = []
val_LC = []
training_block(X_train,y_train, X_val,y_val)
prediction = Prediction_block(X_val)
test_rmse_score, test_r2_score = accuracy(y_val,prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )
# learning_curve(start_observation_flag,end_observation_flag)
pred_str = 'ANN_lr'+str(learning_rate)+'_beta'+str(beta1)+'-'+str(beta2)+'-'+str(beta3)+'-'+str(beta4)+'_hidden'+str(hidden_1)+'-'+str(hidden_2)+'-'+str(hidden_3)+'-'+str(hidden_4)
prediction_dict[pred_str] = prediction
if submit:
submit_prediction = Prediction_block(test_processed)
submit_prediction_dict[pred_str] = submit_prediction
# Data Save
if save_score:
log_df = pd.read_csv("diffrent_training_results.csv")
log_df = log_df.append({'learning_rate' : learning_rate, 'num_steps' : num_steps, 'beta1' : beta1, 'beta2' : beta2, 'beta3' : beta3, 'beta4' : beta4, 'hidden_1' : hidden_1 , 'hidden_2' : hidden_2, 'hidden_3' : hidden_3, 'hidden_4' : hidden_4, 'input_dim' : input_dim , 'test_rmse_score' : test_rmse_score , 'test_r2_score' : test_r2_score}, ignore_index=True)
log_df.to_csv("diffrent_training_results.csv", encoding='utf-8',index=False)
#Following variables are only used to zoom into the graph
start_observation_flag = train_LC.index( min(train_LC)) - 200
end_observation_flag = train_LC.index( min(train_LC)) + 100
learning_curve(start_observation_flag,end_observation_flag)
plot_prediction( y_val, pred_df['ANN_lr0.05_beta0.005-0.1-0.05-0.0_hidden8-32-16-8'])
| layer name | Neuron | value of beta for l2 regularization |
|---|---|---|
| 1st hidden layer | 16 Neuron | .1 |
| 2nd hidden layer | 8 Neuron | .0 |
| 3rd hidden layer | 4 Neuron | 0.0 |
| 4th hidden layer | 2 Neuron | 0 |
tf.reset_default_graph()
learning_rate = 0.05
num_steps = 15000
#for regularize weight matrix
beta1 = 0.1
beta2 = 0.0
beta3 = 0.0
beta4 = 0.0
hidden_1 = 16
hidden_2 = 8
hidden_3 = 4
hidden_4 = 2
minimum_validation_loss = 0.01901000
#tf graph input
X_tf = tf.placeholder("float" )
y_tf = tf.placeholder("float" )
weight_bais()
miscellaneous_initialization()
train_LC = []
val_LC = []
training_block(X_train,y_train, X_val,y_val)
prediction = Prediction_block(X_val)
test_rmse_score, test_r2_score = accuracy(y_val,prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )
# learning_curve(start_observation_flag,end_observation_flag)
pred_str = 'ANN_lr'+str(learning_rate)+'_beta'+str(beta1)+'-'+str(beta2)+'-'+str(beta3)+'-'+str(beta4)+'_hidden'+str(hidden_1)+'-'+str(hidden_2)+'-'+str(hidden_3)+'-'+str(hidden_4)
prediction_dict[pred_str] = prediction
if submit:
submit_prediction = Prediction_block(test_processed)
submit_prediction_dict[pred_str] = submit_prediction
# Data Save
if save_score:
log_df = pd.read_csv("diffrent_training_results.csv")
log_df = log_df.append({'learning_rate' : learning_rate,
'num_steps' : num_steps, 'beta1' : beta1,
'beta2' : beta2, 'beta3' : beta3, 'beta4' : beta4,
'hidden_1' : hidden_1 , 'hidden_2' : hidden_2,
'hidden_3' : hidden_3, 'hidden_4' : hidden_4, 'input_dim' : input_dim ,
'test_rmse_score' : test_rmse_score ,
'test_r2_score' : test_r2_score}, ignore_index=True)
log_df.to_csv("diffrent_training_results.csv", encoding='utf-8',index=False)
#Following variables are only used to zoom into the graph
start_observation_flag = train_LC.index( min(train_LC)) - 100
end_observation_flag = train_LC.index( min(train_LC)) + 100
learning_curve(start_observation_flag,end_observation_flag)
plot_prediction( y_val, pred_df['ANN_lr0.05_beta0.1-0.0-0.0-0.0_hidden16-8-4-2'])
with tf.Session() as sess:
try:
# Restore variables from disk.
saver.restore(sess, "model/model.ckpt")
saver.save(sess, "model/model_ext/model.ckpt")
print("Model Saved for ensemble.")
except:
print("------------ available checkpoint is for different model --------------")
We can observe where overfitting occurs. Overfitting actually occurs if the training loss goes under the validation loss even though the validation is still dropping. It is the sign that network is learning the patterns in the train set that are not applicable in the validation done. In a short note we can say::
Overfitting : training loss << validation loss
Underfitting : training loss >> validation loss
Just right : training loss ~ validation loss
According to this theory, for ANN 1,2 and 3 our both learning curve (validation loss and training loss) is exactly top of one another so in our case validation loss and training loss is almost same so we can say that our model is doing just the right thing. Again In validation score .11,.1081 and .1050 is impressive compared to other models.
Both of the curve actually seems to be on top of each other.The reason is:
For loss function I have used Mean Squared Error (MSE). For reducing MSE I have used SUM_BY_NONZERO_WEIGHTS which divided scalar sum by number of non-zero weights. MSE calculates squared error for all the data and then calculate the mean. Now, all my SalePrice is very small due to normalization (between 10 to 13.5). Where mean of saleprice is 12.02 . Suppose in nth epoch if
for training loss
MSE = (.25+1+.25+.25+1.7)/5 = .69
for validation loss
MSE = (.81+1.69+1.44+.49+.04)/5 = .894
Difference between validation loss and training loss is .204
Usually in regression problem neural network stats to predicts the average value within 5-20 epoch so very quickly the difference between val_loss and training_loss gets much lower. In our dummy example difference is already .204 and if its epoch no is 10, by the time it reaches to 500 epoch the difference could go as low as 10^-4.
tf.reset_default_graph()
def weight_bais():
global weights, biases
weights = {
'w1': tf.Variable(tf.random_normal([input_dim, hidden_1])),
'out': tf.Variable(tf.random_normal([hidden_1, output_dim]))
}
biases = {
'b1': tf.Variable(tf.random_normal([hidden_1])),
'out': tf.Variable(tf.random_normal([output_dim]))
}
def ann_model(X_input):
# Hidden layers
layer_1 = tf.add(tf.matmul(X_input, weights['w1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# Output layer
layer_out = tf.matmul(layer_1, weights['out'])+ biases['out']
return layer_out
def miscellaneous_initialization():
global model, loss , regularizer_1 , regularizer_2 ,regularizer_3, regularizer_4, optimizer , init , saver
# Model Construct
model = ann_model(X_tf)
# Mean Squared Error loss function
loss = tf.losses.mean_squared_error(y_tf , model , reduction=tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS)
# loss = tf.square(y_tf - model)
regularizer_1 = tf.nn.l2_loss(weights['w1'])
# loss = tf.reduce_mean(loss + beta1*regularizer_1 + beta2*regularizer_2 + beta3*regularizer_3)
loss = tf.reduce_mean(loss + beta1*regularizer_1 )
# Adam optimizer will update weights and biases after each step
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
# Initialize variables
init = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
| layer name | Neuron | value of beta for l2 regularization |
|---|---|---|
| 1st hidden layer | 16 Neuron | .1 |
tf.reset_default_graph()
learning_rate = 0.1
num_steps = 15000
#for regularize weight matrix
beta1 = 0.1
beta2 = None
beta3 = None
beta4 = None
minimum_validation_loss = 0.01901000
hidden_1 = 16
hidden_2 = None
hidden_3 = None
hidden_4 = None
#tf graph input
X_tf = tf.placeholder("float" )
y_tf = tf.placeholder("float" )
weight_bais()
miscellaneous_initialization()
train_LC = []
val_LC = []
training_block(X_train,y_train, X_val,y_val)
prediction = Prediction_block(X_val)
test_rmse_score, test_r2_score = accuracy(y_val,prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )
# learning_curve(start_observation_flag,end_observation_flag)
pred_str = 'ANN_lr'+str(learning_rate)+'_beta'+str(beta1)+'-'+str(beta2)+'-'+str(beta3)+'-'+str(beta4)+'_hidden'+str(hidden_1)+'-'+str(hidden_2)+'-'+str(hidden_3)+'-'+str(hidden_4)
prediction_dict[pred_str] = prediction
if submit:
submit_prediction = Prediction_block(test_processed)
submit_prediction_dict[pred_str] = submit_prediction
# Data Save
if save_score:
log_df = pd.read_csv("diffrent_training_results.csv")
log_df = log_df.append({'learning_rate' : learning_rate, 'num_steps' : num_steps, 'beta1' : beta1, 'beta2' : beta2, 'beta3' : beta3, 'beta4' : beta4, 'hidden_1' : hidden_1 , 'hidden_2' : hidden_2, 'hidden_3' : hidden_3, 'hidden_4' : hidden_4, 'input_dim' : input_dim , 'test_rmse_score' : test_rmse_score , 'test_r2_score' : test_r2_score}, ignore_index=True)
log_df.to_csv("diffrent_training_results.csv", encoding='utf-8',index=False)
#Following variables are only used to zoom into the graph
start_observation_flag = train_LC.index( min(train_LC)) - 50
end_observation_flag = train_LC.index( min(train_LC)) + 100
learning_curve(start_observation_flag,end_observation_flag)
plot_prediction( y_val, pred_df['ANN_lr0.1_beta0.1-None-None-None_hidden16-None-None-None'])
| layer name | Neuron | value of beta for l2 regularization |
|---|---|---|
| 1st hidden layer | 4 Neuron | .1 |
tf.reset_default_graph()
learning_rate = 0.1
num_steps = 8000
#for regularize weight matrix
beta1 = 0
beta2 = None
beta3 = None
beta4 = None
hidden_1 = 4
hidden_2 = None
hidden_3 = None
hidden_4 = None
minimum_validation_loss = 0.1701000
#tf graph input
X_tf = tf.placeholder("float" )
y_tf = tf.placeholder("float" )
weight_bais()
miscellaneous_initialization()
train_LC = []
val_LC = []
training_block(X_train,y_train, X_val,y_val)
prediction = Prediction_block(X_val)
test_rmse_score, test_r2_score = accuracy(y_val,prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )
# learning_curve(start_observation_flag,end_observation_flag)
pred_str = 'ANN_lr'+str(learning_rate)+'_beta'+str(beta1)+'-'+str(beta2)+'-'+str(beta3)+'-'+str(beta4)+'_hidden'+str(hidden_1)+'-'+str(hidden_2)+'-'+str(hidden_3)+'-'+str(hidden_4)
prediction_dict[pred_str] = prediction
if submit:
submit_prediction = Prediction_block(test_processed)
submit_prediction_dict[pred_str] = submit_prediction
# Data Save
if save_score:
log_df = pd.read_csv("diffrent_training_results.csv")
log_df = log_df.append({'learning_rate' : learning_rate, 'num_steps' : num_steps, 'beta1' : beta1, 'beta2' : beta2, 'beta3' : beta3, 'beta4' : beta4, 'hidden_1' : hidden_1 , 'hidden_2' : hidden_2, 'hidden_3' : hidden_3, 'hidden_4' : hidden_4, 'input_dim' : input_dim , 'test_rmse_score' : test_rmse_score , 'test_r2_score' : test_r2_score}, ignore_index=True)
log_df.to_csv("diffrent_training_results.csv", encoding='utf-8',index=False)
#Following variables are only used to zoom into the graph
start_observation_flag = train_LC.index( min(train_LC)) - 200
end_observation_flag = train_LC.index( min(train_LC)) + 100
learning_curve(start_observation_flag,end_observation_flag)
plot_prediction( y_val, pred_df['ANN_lr0.1_beta0-None-None-None_hidden4-None-None-None'])
| layer name | Neuron | value of beta for l2 regularization |
|---|---|---|
| 1st hidden layer | 2 Neuron | .1 |
tf.reset_default_graph()
learning_rate = 0.1
num_steps = 15000
#for regularize weight matrix
beta1 = 0
beta2 = None
beta3 = None
beta4 = None
hidden_1 = 2
hidden_2 = None
hidden_3 = None
hidden_4 = None
minimum_validation_loss = 0.01901000
#tf graph input
X_tf = tf.placeholder("float" )
y_tf = tf.placeholder("float" )
weight_bais()
miscellaneous_initialization()
train_LC = []
val_LC = []
training_block(X_train,y_train, X_val,y_val)
prediction = Prediction_block(X_val)
test_rmse_score, test_r2_score = accuracy(y_val,prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )
learning_curve(start_observation_flag,end_observation_flag)
pred_str = 'ANN_lr'+str(learning_rate)+'_beta'+str(beta1)+'-'+str(beta2)+'-'+str(beta3)+'-'+str(beta4)+'_hidden'+str(hidden_1)+'-'+str(hidden_2)+'-'+str(hidden_3)+'-'+str(hidden_4)
prediction_dict[pred_str] = prediction
if submit:
submit_prediction = Prediction_block(test_processed)
submit_prediction_dict[pred_str] = submit_prediction
# Data Save
if save_score:
log_df = pd.read_csv("diffrent_training_results.csv")
log_df = log_df.append({'learning_rate' : learning_rate, 'num_steps' : num_steps, 'beta1' : beta1, 'beta2' : beta2, 'beta3' : beta3, 'beta4' : beta4, 'hidden_1' : hidden_1 , 'hidden_2' : hidden_2, 'hidden_3' : hidden_3, 'hidden_4' : hidden_4, 'input_dim' : input_dim , 'test_rmse_score' : test_rmse_score , 'test_r2_score' : test_r2_score}, ignore_index=True)
log_df.to_csv("diffrent_training_results.csv", encoding='utf-8',index=False)
plot_prediction( y_val, pred_df['ANN_lr0.1_beta0-None-None-None_hidden2-None-None-None'])
We can observe where overfitting occurs. Overfitting actually occurs if the training loss goes under the validation loss even though the validation is still dropping. It is the sign that network is learning the patterns in the train set that are not applicable in the validation done. In a short note we can say::
Overfitting : training loss << validation loss
Underfitting : training loss >> validation loss
Just right : training loss ~ validation loss
According to this theory, for ANN 4 our both learning curve (validation loss and training loss) is exactly top of one another so in our case validation loss and training loss is almost same so we can say that our model is doing just the right thing. Again In validation score .1059 is impressive compared to other models.
But for ANN 5 and 6 training loss << validation loss so we can say that this two model overfit data due to lower amount of neuron but ANN 4 have just the right amount of neuron thats why with similar parameter this overfit occered.
Sometimes Both of the curve actually seems to be on top of each other.The reason is:
For loss function I have used Mean Squared Error (MSE). For reducing MSE I have used SUM_BY_NONZERO_WEIGHTS which divided scalar sum by number of non-zero weights. MSE calculates squared error for all the data and then calculate the mean. Now, all my SalePrice is very small due to normalization (between 10 to 13.5). Where mean of saleprice is 12.02 . Suppose in nth epoch if
for training loss
MSE = (.25+1+.25+.25+1.7)/5 = .69
for validation loss
MSE = (.81+1.69+1.44+.49+.04)/5 = .894
Difference between validation loss and training loss is .204
Usually in regression problem neural network stats to predicts the average value within 5-20 epoch so very quickly the difference between val_loss and training_loss gets much lower. In our dummy example difference is already .204 and if its epoch no is 10, by the time it reaches to 500 epoch the difference could go as low as 10^-4.
Few of my hyperparameeter tuning is shown in the following block. In this data if a hidden layer value is 0 then it means that the hidden layer is turned off. For example if hidden_3 = 0 then that means hidden layer 3 is removed from the model and the model have only 2 hidden layer. And all the score is done on a validation set which is not seen by the model while training. For most of the case it was a 80-20 split. In the following results I didint kept any cross validation results but I have used diffrent seed while splitting data due to diffrent seed sometimes good hyperparameeter also provided so so accuracy.
import pandas as pd
log_df = pd.read_csv("diffrent_training_results.csv")
# print(log_df.to_string())
pd.set_option('display.max_rows', None)
log_df
In the above parameter we can see that index 44 shows that for .001 learning parameter the model does not predict anything so I have changed it slowly and finally What I have found that learning parameter .1 and .05 provides the best results.
Beta1, Beta2, Beta3, Beta4 represents the regularization parameter for hidden layer 1 ,2 ,3 and 4. Sometimes in the above table we can see that hidden layer 2,3,4 is 0 or NaN but there is some value for beta 2,3,4 that means the layer is actually off so those values actually means nothing.
For 3 layer model when beta1, beta2, beta3 is .005, model shows significant amount of improvement while learning rate is .1 or .05 . But when learning rate is .1 and beta1=.1 , beta2=0, beta3=0 then the model performs even better most of the time and it also takes less epochs to train for the best validation accuracy
From index 63 to 69 I have tried to use 200 , 100 , 30 neurons because the data have 403 features and its a common practice to use half amount of the neuron in the first hidden layer and this strategy does not work good enough but with my selected parameter it improved a little bit. I have used 16-8-4 combination of neuron because of this common practice. for our case 16 neuron in the first layer provided better accuracy and adding 8 and 4 in the next 2 layer improved the stability of the model and now it gives good validation accuracy after 2000 epoch and the best validation accuracy remains between the epoch range of 2000-2500 , 3300-3600 or 5000-5400 .
From index 70 to 78 we can see that single neuron with single hidden layer performs well according to the plan stated in the target section. Then I have increased neurons and the learning curve for them is in the following block. Where y axis shows rmse and x axis shows i and i*50 represents the epoch no. Again blue curve is for training accuracy and green for validation accuracy
In the table index 169 and 155 the model is exactly same with same parameter but one of them is providing .123 and other is providing .41 and that shows how inconsistent model become when we increase the neuron of the first hidden layer.
We can see that even after adding another layer ANN does not perform well when we are increasing neurons in the first layer. The reason behind it is that this type of regression problem usually do well with logistic regression. By increasing neurons we cant do much improvement and all we need to do is properly regularize small amount of neurons so that they can perform well.




I am using bagging method for this section. Usually in this technique we add different models results and average them. But instead of averaging I am taking different fraction from different models result. Finally making sure that it sums up to 1.
I have tried different combinations of ensemble learning to improve performance. Kaggle has a certain limitation on uploading submission files. So what I have tried is that before submitting it to kaggle, I have made 80-20 split. I made prediction on the 20% data. Then I have tried ensemble learning so that before submission I can confirm which combination might work well.
Following 3 section arranges diffrent prediction results for ensembling.
x = ['Random Forest Regressor',
'DecisionTree','Xgboost','Lasso',
'ANN_base_lr0.1_beta0.1-0.0-0.0-None_hidden16-8-4-None',
'ANN_lr0.1_beta0.1-0.05-0.0-0.0_hidden76-48-32-16',
'ANN_lr0.05_beta0.005-0.1-0.05-0.0_hidden8-32-16-8',
'ANN_lr0.05_beta0.1-0.0-0.0-0.0_hidden16-8-4-2',
'ANN_lr0.1_beta0.1-None-None-None_hidden16-None-None-None',
'ANN_lr0.1_beta0-None-None-None_hidden4-None-None-None',
'ANN_lr0.1_beta0-None-None-None_hidden2-None-None-None']
d = dict()
for k in x:
if not submit:
d[k] = prediction_dict[k]
if submit:
d[k] = submit_prediction_dict[k]
if not submit:
prediction_dict = d
if submit:
submit_prediction_dict = d
if submit :
pred_df = pd.read_csv("diffrent_pred_results.csv")
else:
pred_df = pd.read_csv("pred_results.csv")
if not submit:
pd.set_option('display.max_colwidth', -1)
pred_df = pd.DataFrame(prediction_dict)
pred_df.to_csv("pred_results.csv", encoding='utf-8',index=False)
else:
pd.set_option('display.max_colwidth', -1)
pred_df = pd.DataFrame(submit_prediction_dict)
pred_df.to_csv("diffrent_pred_results.csv", encoding='utf-8',index=False)
pd.DataFrame(pred_df.columns)
| Name | learning rate | beta1 | beta 2 | beta 3 | beta 4 | hidden layer 1 | hidden layer 2 | hidden layer 3 | hidden layer 4 |
|---|---|---|---|---|---|---|---|---|---|
| ANN_base_lr0.1_beta0.1-0.0-0.0-None_hidden16-8-4-None | 0.1 | 0.1 | 0.0 | 0.0 | None | 16 | 8 | 4 | None |
| ANN_lr0.05_beta0.005-0.1-0.05-0.0_hidden8-32-16-8 | 0.05 | 0.005 | 0.1 | 0.05 | 0 | 8 | 32 | 16 | 6 |
| ANN_lr0.05_beta0.1-0.0-0.0-0.0_hidden16-8-4-2 | .05 | 0.1 | 0.0 | 0.0 | 0.0 | 16 | 8 | 4 | 2 |
| ANN_lr0.1_beta0-None-None-None_hidden2-None-None-None | 0.1 | 0 | None | None | None | 2 | None | None | None |
| ANN_lr0.1_beta0.1-0.05-0.0-0.0_hidden76-48-32-16 | 0.1 | .1 | 0.05 | 0.0 | 0.0 | 76 | 48 | 32 | 16 |
| ANN_lr0.1_beta0.1-None-None-None_hidden16-None-None-None | 0.1 | 0.1 | None | None | None | 16 | None | None | None |
| ANN_lr0.1_beta0-None-None-None_hidden4-None-None-None | 0.1 | 0 | None | None | None | 4 | None | None | None |
# pred_df[pred_df.columns[[1,3,5]]] * [1,2,30]
print('Using ' , pred_df.columns[[4,3,2]].values)
prediction = pred_df[pred_df.columns[[4,3,2]]] * [.4,.2,.4]
prediction = prediction.sum(axis = 1)
if not submit:
test_rmse_score, test_r2_score = accuracy(y_val, prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )

prediction = pred_df[pred_df.columns[[4,3,2]]] * [.4,.3,.3]
prediction = prediction.sum(axis = 1)
if not submit:
test_rmse_score, test_r2_score = accuracy(y_val, prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )
# pred_df[pred_df.columns[[1,3,5]]] * [1,2,30]
print('Using ' , pred_df.columns[[2,4,5,6,7]].values)
prediction = pred_df[pred_df.columns[[2,4,5,6,7]]] * [.25,.2,.2 ,.15 , .2]
prediction = prediction.sum(axis = 1)
if not submit:
test_rmse_score, test_r2_score = accuracy(y_val, prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )

print('Using ' , pred_df.columns[[0,4,5,6,7]].values)
prediction = pred_df[pred_df.columns[[0,4,5,6,7]]] * [.25,.2,.2 ,.15 , .2]
prediction = prediction.sum(axis = 1)
if not submit:
test_rmse_score, test_r2_score = accuracy(y_val, prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )

print('Using ' , pred_df.columns[[4,5,6,7,0,2,3]].values)
prediction = pred_df[pred_df.columns[[4,5,6,7,0,2,3]]] * [.15,.1,.1,.05,.0,.2,.4]
prediction = prediction.sum(axis = 1)
if not submit:
test_rmse_score, test_r2_score = accuracy(y_val, prediction)
print('ann root mean absolute error: ', test_rmse_score)
print('accuracy score: ', test_r2_score )

Ensemble combination 4 provides the score 0.12192. Currently combination 4 is showing that rmse value is .1004 and there is a worse value present in combination 1 which is 0.1007. The reason behind the difference is ANN does not perform exactly same each time. That means if I currently submit with Combination 1 I might get similar result.
In Combination 4 used models with their parameters:
| Name | learning rate | beta1 | beta 2 | beta 3 | beta 4 | hidden layer 1 | hidden layer 2 | hidden layer 3 | hidden layer 4 | Fraction taken |
|---|---|---|---|---|---|---|---|---|---|---|
| ANN_base_lr0.1_beta0.1-0.0-0.0-None_hidden16-8-4-None | 0.1 | 0.1 | 0.0 | 0.0 | None | 16 | 8 | 4 | None | .15 |
| ANN_lr0.05_beta0.005-0.1-0.05-0.0_hidden8-32-16-8 | 0.05 | 0.005 | 0.1 | 0.05 | 0 | 8 | 32 | 16 | 6 | .1 |
| ANN_lr0.05_beta0.1-0.0-0.0-0.0_hidden16-8-4-2 | .05 | 0.1 | 0.0 | 0.0 | 0.0 | 16 | 8 | 4 | 2 | .05 |
| ANN_lr0.1_beta0.1-0.05-0.0-0.0_hidden76-48-32-16 | 0.1 | .1 | 0.05 | 0.0 | 0.0 | 76 | 48 | 32 | 16 | .1 |
| Xgboost | 0.05 | Not applicable | Not applicable | Not applicable | Not applicable | Not applicable | Not applicable | NonNot applicablee | Not applicable | .2 |
| Lasso | alpha = 5e-4 | Not applicable | Not applicable | Not applicable | Not applicable | Not applicable | Not applicable | NonNot applicablee | Not applicable | .4 |
In the learning curve graph if the minimum of training and validation is close to each other then its good to use that model. Again if training minimum and validation minimum is no where near each other then using them does not help much most of the case. When both of them are close we can use the epoch no of the train_min loss as val_min loss epoch no and then we can train over all the dataset without depending on the epoch number. The model does not give same result in same epoch every time. This is the main reason behind removing the epoch dependency.
To use this section please uncomment the last line of split data section and comment accuracy section.
pd.DataFrame(pred_df.columns)
use_ensemble = True
#if ensemble =false then chose a model
choose_model = 6
#if want to use given test data
if submit:
# X_val = test_processed
if not use_ensemble:
prediction = pred_df[pred_df.columns[[choose_model]]]
prediction = np.exp(prediction.values)
pred_out_df = pd.DataFrame(prediction, index=test["Id"], columns=["SalePrice"])
pred_out_df.to_csv('output.csv', header=True, index_label='Id')
My target of this report was to improve ANN model and show how well it can perform with ANN model. In the beginning of the report I have build a ANN model that performs better or similar to other ANN model I have showed in my report. I have performed cross validation on that model and that model scored 0.11912 in kaggle. Then I have showed some other models that performs well but can't beat the score 0.11912 . Then I have explained why some models with certain parameter works well. After that I showed a table where different models performance is listed and added my analysis and observation. Then I have have Showed four combination of Ensemble and their kaggle score is also attached with them. In the 2nd combination of Ensemble method I have found the best kaggle score which is 0.11706. This score is achived through combining 4 ann models and xgboost. I have used the first 4 ANN models for this Ensemble.

https://www.kaggle.com/dansbecker/xgboost
https://medium.com/@gabrieltseng/gradient-boosting-and-xgboost-c306c1bcfaf5
https://www.kaggle.com/janiobachmann/predicting-house-prices-regression-techniques
https://www.kaggle.com/dansbecker/selecting-and-filtering-in-pandas
https://www.kaggle.com/dansbecker/handling-missing-values
https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69
https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc
https://www.saedsayad.com/decision_tree_reg.htm
https://www.kaggle.com/apapiu/house-prices-advanced-regression-techniques/regularized-linear-models
https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset
I have inspired form Ian Goodfellows book and used his way of explanation to explain my choice. His book can be found here: https://www.deeplearningbook.org/
I have also followed data flatter for definition and their lessons can be found here: https://data-flair.training/blogs/neural-network-for-machine-learning/